A web crawler based on requests-html, mainly targets for url validation test.
Project description
requests-crawler
A web crawler based on requests-html, mainly targets for url validation test.
Features
- based on requests-html, full JavaScript support!
- support requests frequency limitation, e.g. rps/rpm
- support crawl with headers and cookies
- include & exclude mechanism
- group visited urls by HTTP status code
- display url's referers and hyper links
Installation/Upgrade
$ pip install requests-crawler
Only Python 3.6 is supported.
To ensure the installation or upgrade is successful, you can execute command requests_crawler -V
to see if you can get the correct version number.
$ requests_crawler -V
0.5.3
Usage
$ requests_crawler -h usage: requests_crawler [-h] [-V] [--log-level LOG_LEVEL] [--seed SEED] [--headers [HEADERS [HEADERS ...]]] [--cookies [COOKIES [COOKIES ...]]] [--requests-limit REQUESTS_LIMIT] [--interval-limit INTERVAL_LIMIT] [--include [INCLUDE [INCLUDE ...]]] [--exclude [EXCLUDE [EXCLUDE ...]]] [--workers WORKERS] A web crawler based on requests-html, mainly targets for url validation test. optional arguments: -h, --help show this help message and exit -V, --version show version --log-level LOG_LEVEL Specify logging level, default is INFO. --seed SEED Specify crawl seed url --headers [HEADERS [HEADERS ...]] Specify headers, e.g. 'User-Agent:iOS/10.3' --cookies [COOKIES [COOKIES ...]] Specify cookies, e.g. 'lang=en country:us' --requests-limit REQUESTS_LIMIT Specify requests limit for crawler, default rps. --interval-limit INTERVAL_LIMIT Specify limit interval, default 1 second. --include [INCLUDE [INCLUDE ...]] Urls include the snippets will be crawled recursively. --exclude [EXCLUDE [EXCLUDE ...]] Urls include the snippets will be skipped. --workers WORKERS Specify concurrent workers number.
Examples
Basic usage.
$ requests_crawler --seed http://debugtalk.com
Crawl with headers and cookies.
$ requests_crawler --seed http://debugtalk.com --headers User-Agent:iOS/10.3 --cookies lang:en country:us
Crawl with 30 rps limitation.
$ requests_crawler --seed http://debugtalk.com --requests-limit 30
Crawl with 500 rpm limitation.
$ requests_crawler --seed http://debugtalk.com --requests-limit 500 --interval-limit 60
Crawl with extra hosts, e.g. httprunner.org
will also be crawled recursively.
$ requests_crawler --seed http://debugtalk.com --include httprunner.org
Skip excluded url snippets, e.g. urls include httprunner
will be skipped.
$ requests_crawler --seed http://debugtalk.com --exclude httprunner
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
requests-crawler-0.5.4.tar.gz
(11.3 kB
view hashes)
Built Distribution
Close
Hashes for requests_crawler-0.5.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2b557e8fdfa5e66ccbea8ce7baf7424cc6d4b3582375963b1fee1b23e47a297 |
|
MD5 | e9fad5898a1c350dd7730443f3feeb1e |
|
BLAKE2-256 | 7db7c73ab226be33a1e1788f8b5165b700f7557266354f6db7619494c9267e6a |