A web crawler based on requests-html, mainly targets for url validation test.
Project description
requests-crawler
A web crawler based on requests-html, mainly targets for url validation test.
Features
- based on requests-html, full JavaScript support!
- support requests frequency limitation, e.g. rps/rpm
- support crawl with headers and cookies
- include & exclude mechanism
- group visited urls by HTTP status code
- display url's referers and hyper links
Installation/Upgrade
$ pip install requests-crawler
Only Python 3.6 is supported.
To ensure the installation or upgrade is successful, you can execute command requests_crawler -V
to see if you can get the correct version number.
$ requests_crawler -V
0.5.3
Usage
$ requests_crawler -h
usage: requests_crawler [-h] [-V] [--log-level LOG_LEVEL]
[--seed SEED]
[--headers [HEADERS [HEADERS ...]]]
[--cookies [COOKIES [COOKIES ...]]]
[--requests-limit REQUESTS_LIMIT]
[--interval-limit INTERVAL_LIMIT]
[--include [INCLUDE [INCLUDE ...]]]
[--exclude [EXCLUDE [EXCLUDE ...]]]
[--workers WORKERS]
A web crawler based on requests-html, mainly targets for url validation test.
optional arguments:
-h, --help show this help message and exit
-V, --version show version
--log-level LOG_LEVEL
Specify logging level, default is INFO.
--seed SEED Specify crawl seed url
--headers [HEADERS [HEADERS ...]]
Specify headers, e.g. 'User-Agent:iOS/10.3'
--cookies [COOKIES [COOKIES ...]]
Specify cookies, e.g. 'lang=en country:us'
--requests-limit REQUESTS_LIMIT
Specify requests limit for crawler, default rps.
--interval-limit INTERVAL_LIMIT
Specify limit interval, default 1 second.
--include [INCLUDE [INCLUDE ...]]
Urls include the snippets will be crawled recursively.
--exclude [EXCLUDE [EXCLUDE ...]]
Urls include the snippets will be skipped.
--workers WORKERS Specify concurrent workers number.
Examples
Basic usage.
$ requests_crawler --seed http://debugtalk.com
Crawl with headers and cookies.
$ requests_crawler --seed http://debugtalk.com --headers User-Agent:iOS/10.3 --cookies lang:en country:us
Crawl with 30 rps limitation.
$ requests_crawler --seed http://debugtalk.com --requests-limit 30
Crawl with 500 rpm limitation.
$ requests_crawler --seed http://debugtalk.com --requests-limit 500 --interval-limit 60
Crawl with extra hosts, e.g. httprunner.org
will also be crawled recursively.
$ requests_crawler --seed http://debugtalk.com --include httprunner.org
Skip excluded url snippets, e.g. urls include httprunner
will be skipped.
$ requests_crawler --seed http://debugtalk.com --exclude httprunner
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
requests-crawler-0.5.4.tar.gz
(11.3 kB
view hashes)
Built Distribution
Close
Hashes for requests_crawler-0.5.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2b557e8fdfa5e66ccbea8ce7baf7424cc6d4b3582375963b1fee1b23e47a297 |
|
MD5 | e9fad5898a1c350dd7730443f3feeb1e |
|
BLAKE2b-256 | 7db7c73ab226be33a1e1788f8b5165b700f7557266354f6db7619494c9267e6a |