Skip to main content

A web crawler based on requests-html, mainly targets for url validation test.

Project description


A web crawler based on requests-html, mainly targets for url validation test.


  • based on requests-html, full JavaScript support!
  • support requests frequency limitation, e.g. rps/rpm
  • support crawl with headers and cookies
  • include & exclude mechanism
  • group visited urls by HTTP status code
  • display url's referers and hyper links


$ pip install requests-crawler

Only Python 3.6 is supported.

To ensure the installation or upgrade is successful, you can execute command requests_crawler -V to see if you can get the correct version number.

$ requests_crawler -V


$ requests_crawler -h
usage: requests_crawler [-h] [-V] [--log-level LOG_LEVEL]
                        [--seed SEED]
                        [--headers [HEADERS [HEADERS ...]]]
                        [--cookies [COOKIES [COOKIES ...]]]
                        [--requests-limit REQUESTS_LIMIT]
                        [--interval-limit INTERVAL_LIMIT]
                        [--include [INCLUDE [INCLUDE ...]]]
                        [--exclude [EXCLUDE [EXCLUDE ...]]]
                        [--workers WORKERS]

A web crawler based on requests-html, mainly targets for url validation test.

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show version
  --log-level LOG_LEVEL
                        Specify logging level, default is INFO.
  --seed SEED           Specify crawl seed url
  --headers [HEADERS [HEADERS ...]]
                        Specify headers, e.g. 'User-Agent:iOS/10.3'
  --cookies [COOKIES [COOKIES ...]]
                        Specify cookies, e.g. 'lang=en country:us'
  --requests-limit REQUESTS_LIMIT
                        Specify requests limit for crawler, default rps.
  --interval-limit INTERVAL_LIMIT
                        Specify limit interval, default 1 second.
  --include [INCLUDE [INCLUDE ...]]
                        Urls include the snippets will be crawled recursively.
  --exclude [EXCLUDE [EXCLUDE ...]]
                        Urls include the snippets will be skipped.
  --workers WORKERS     Specify concurrent workers number.


Basic usage.

$ requests_crawler --seed

Crawl with headers and cookies.

$ requests_crawler --seed --headers User-Agent:iOS/10.3 --cookies lang:en country:us

Crawl with 30 rps limitation.

$ requests_crawler --seed --requests-limit 30

Crawl with 500 rpm limitation.

$ requests_crawler --seed --requests-limit 500 --interval-limit 60

Crawl with extra hosts, e.g. will also be crawled recursively.

$ requests_crawler --seed --include

Skip excluded url snippets, e.g. urls include httprunner will be skipped.

$ requests_crawler --seed --exclude httprunner

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests-crawler-0.5.4.tar.gz (11.3 kB view hashes)

Uploaded source

Built Distribution

requests_crawler-0.5.4-py2.py3-none-any.whl (20.6 kB view hashes)

Uploaded py2 py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page