Skip to main content

A web crawler based on requests-html, mainly targets for url validation test.

Project description

requests-crawler

A web crawler based on requests-html, mainly targets for url validation test.

Features

  • based on requests-html, full JavaScript support!
  • support requests frequency limitation, e.g. rps/rpm
  • support crawl with headers and cookies
  • include & exclude mechanism
  • group visited urls by HTTP status code
  • display url's referers and hyper links

Installation/Upgrade

$ pip install requests-crawler

Only Python 3.6 is supported.

To ensure the installation or upgrade is successful, you can execute command requests_crawler -V to see if you can get the correct version number.

$ requests_crawler -V
0.5.3

Usage

$ requests_crawler -h
usage: requests_crawler [-h] [-V] [--log-level LOG_LEVEL]
                        [--seed SEED]
                        [--headers [HEADERS [HEADERS ...]]]
                        [--cookies [COOKIES [COOKIES ...]]]
                        [--requests-limit REQUESTS_LIMIT]
                        [--interval-limit INTERVAL_LIMIT]
                        [--include [INCLUDE [INCLUDE ...]]]
                        [--exclude [EXCLUDE [EXCLUDE ...]]]
                        [--workers WORKERS]

A web crawler based on requests-html, mainly targets for url validation test.

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show version
  --log-level LOG_LEVEL
                        Specify logging level, default is INFO.
  --seed SEED           Specify crawl seed url
  --headers [HEADERS [HEADERS ...]]
                        Specify headers, e.g. 'User-Agent:iOS/10.3'
  --cookies [COOKIES [COOKIES ...]]
                        Specify cookies, e.g. 'lang=en country:us'
  --requests-limit REQUESTS_LIMIT
                        Specify requests limit for crawler, default rps.
  --interval-limit INTERVAL_LIMIT
                        Specify limit interval, default 1 second.
  --include [INCLUDE [INCLUDE ...]]
                        Urls include the snippets will be crawled recursively.
  --exclude [EXCLUDE [EXCLUDE ...]]
                        Urls include the snippets will be skipped.
  --workers WORKERS     Specify concurrent workers number.

Examples

Basic usage.

$ requests_crawler --seed http://debugtalk.com

Crawl with headers and cookies.

$ requests_crawler --seed http://debugtalk.com --headers User-Agent:iOS/10.3 --cookies lang:en country:us

Crawl with 30 rps limitation.

$ requests_crawler --seed http://debugtalk.com --requests-limit 30

Crawl with 500 rpm limitation.

$ requests_crawler --seed http://debugtalk.com --requests-limit 500 --interval-limit 60

Crawl with extra hosts, e.g. httprunner.org will also be crawled recursively.

$ requests_crawler --seed http://debugtalk.com --include httprunner.org

Skip excluded url snippets, e.g. urls include httprunner will be skipped.

$ requests_crawler --seed http://debugtalk.com --exclude httprunner

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for requests-crawler, version 0.5.4
Filename, size File type Python version Upload date Hashes
Filename, size requests_crawler-0.5.4-py2.py3-none-any.whl (20.6 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size requests-crawler-0.5.4.tar.gz (11.3 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page