Skip to main content

Fast async web scraper — CLI tool, Python library, and AI agent skill.

Project description

wscrape

Fast async web scraper — CLI tool, Python library, and AI agent skill.

wscrape is a production-ready Python package that scrapes websites, optionally crawls links recursively, and returns structured data suitable for AI consumption.

Installation

pip install wscrape

For development:

pip install wscrape[dev]

CLI Usage

# Scrape a single page
wscrape https://example.com

# Deep crawl with depth limit
wscrape https://example.com --deep --max-depth 2

# Save results to a file
wscrape https://example.com --deep --save --output data.json

# CSV output
wscrape https://example.com --format csv --save --output data.csv

# Custom rate limiting (milliseconds between requests)
wscrape https://example.com --deep --rate 500

CLI Options

Option Description Default
<url> Target URL to scrape (required)
--deep Recursively crawl same-domain links False
--save Save output to a file False
--output <file> Output file path output.json
--format <json|csv> Output format json
--max-depth <n> Maximum recursion depth 1
--rate <ms> Delay between requests in ms 200
--verbose / -v Enable debug logging False

Python Library Usage

from wscrape import run_scraper

# Synchronous — works everywhere
results = run_scraper("https://example.com", deep=True, max_depth=2)
for page in results:
    print(page["url"], page["title"])

Async Usage

import asyncio
from wscrape import async_run_scraper

results = asyncio.run(
    async_run_scraper("https://example.com", deep=True, max_depth=2)
)

AI Agent Skill

wscrape is designed to be called programmatically by AI agents (Claude, OpenAI, etc.):

from wscrape import run_scraper

# Returns list[dict] with keys: url, title, text, links, depth
data = run_scraper("https://example.com")

Output Format

Each scraped page produces:

{
  "url": "https://example.com",
  "title": "Example Domain",
  "text": "Example Domain This domain is for use in illustrative examples ...",
  "links": ["https://www.iana.org/domains/example"],
  "depth": 0
}

Architecture

wscrape/
  __init__.py      # Public API & AI skill interface (run_scraper, async_run_scraper)
  __main__.py      # python -m wscrape support
  cli.py           # CLI entry point (argparse)
  config.py        # ScrapeConfig dataclass
  exceptions.py    # Custom exception hierarchy
  core/
    fetcher.py     # Async HTTP client with retries & backoff
    parser.py      # HTML parsing (lxml) — title, text, links, CSS, XPath
    crawler.py     # BFS crawler with rate limiting & same-domain policy
    output.py      # JSON / CSV serialisation & file output
tests/
  test_fetcher.py  # Fetcher unit tests (mocked HTTP)
  test_parser.py   # Parser unit tests
  test_crawler.py  # Crawler unit tests (mocked HTTP)
  test_cli.py      # CLI integration tests

Development

git clone https://github.com/wscrape/wscrape.git
cd wscrape
pip install -e ".[dev]"
pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wscrape_cli-1.0.0.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wscrape_cli-1.0.0-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file wscrape_cli-1.0.0.tar.gz.

File metadata

  • Download URL: wscrape_cli-1.0.0.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wscrape_cli-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8bd6a5dcfbc19199c182c762a8f1f02a16a9e551cd96da29e4f2ff7b743044b3
MD5 0588dce2452218dd6151d9eac25edecf
BLAKE2b-256 80dc1f3b9c853aa863324d8ff86883c22c9dc198f270ed908958741424af9213

See more details on using hashes here.

File details

Details for the file wscrape_cli-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: wscrape_cli-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wscrape_cli-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0dd6f1a19a00ad923c39b6eac685f3a793de3cc37a2f5b347caef4c22705b024
MD5 1835d33b929941a56be0475a7bee8a04
BLAKE2b-256 5b4c567762d234dc672a41141fbf1c1321a16a298c0cb2a7efe92ece29a34e8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page