Skip to main content

Lightweight web scraper with rate limiting and CSS selectors

Project description

philiprehberger-web-scraper

Tests PyPI version License

Lightweight web scraper with rate limiting and CSS selectors.

Install

pip install philiprehberger-web-scraper

Usage

from philiprehberger_web_scraper import Scraper

scraper = Scraper(rate_limit=2.0, retry_attempts=3)

# Fetch a single page
page = scraper.get("https://example.com")
titles = page.select_all("h2.title")
link = page.select_one("a.next")
all_links = page.links()

# Extract data
for el in page.select_all(".product"):
    print(el.select_one(".name").text)
    print(el.select_one("a").attr("href"))

# Crawl multiple pages
for page in scraper.crawl("https://example.com/blog", max_pages=20):
    for article in page.select_all("article"):
        print(article.select_one("h2").text)

# Export
Scraper.export_csv(data, "output.csv")
Scraper.export_json(data, "output.json")

Features

  • Built-in rate limiting (token bucket)
  • Retry with backoff on 429/5xx errors
  • CSS selector API wrapping BeautifulSoup
  • Crawl mode with same-domain filtering
  • Link and image extraction
  • CSV and JSON export helpers

Options

Scraper(
    rate_limit=2.0,        # max requests/second
    retry_attempts=3,      # retries on failure
    retry_delay=1.0,       # base delay between retries
    timeout=30.0,          # request timeout
    headers={...},         # custom headers
)

Development

pip install -e .
python -m pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

philiprehberger_web_scraper-0.1.5.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

philiprehberger_web_scraper-0.1.5-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file philiprehberger_web_scraper-0.1.5.tar.gz.

File metadata

File hashes

Hashes for philiprehberger_web_scraper-0.1.5.tar.gz
Algorithm Hash digest
SHA256 b4ea6f35d3474fd9a9ac553786cc3226beb28e195df7dc3aeb6a70caa667e401
MD5 200300253817ce49a35bb1da8ac604df
BLAKE2b-256 202b0006d6a64e824d5bcf9ca6c74802d23d4b0d5c7b8d9e505e742c0a6367dc

See more details on using hashes here.

File details

Details for the file philiprehberger_web_scraper-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for philiprehberger_web_scraper-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 57372cbf78266f38222fc8b4e1903ca406733ac015a59dc23b0026e0a7d0c19d
MD5 962bc7dddaa99670643654a82e49338a
BLAKE2b-256 2dfdca724feafc686a83fec718060ca3f02d6cd37ffb856c196d74ceae0393e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page