A web crawler based on requests-html, mainly targets for url validation test.
Project description
requests-crawler
A web crawler based on requests-html, mainly targets for url validation test.
Features
- based on requests-html, full JavaScript support!
- support requests frequency limitation, e.g. rps/rpm
- support crawl with headers and cookies
- include & exclude mechanism
- group visited urls by HTTP status code
- display url's referers and hyper links
Installation/Upgrade
$ pip install -U git+https://github.com/debugtalk/WebCrawler.git
Only Python 3.6 is supported.
To ensure the installation or upgrade is successful, you can execute command requests_crawler -V
to see if you can get the correct version number.
$ requests_crawler -V
0.5.2
Usage
$ requests_crawler -h usage: requests_crawler [-h] [-V] [--log-level LOG_LEVEL] [--seed SEED] [--headers [HEADERS [HEADERS ...]]] [--cookies [COOKIES [COOKIES ...]]] [--requests-limit REQUESTS_LIMIT] [--interval-limit INTERVAL_LIMIT] [--include [INCLUDE [INCLUDE ...]]] [--exclude [EXCLUDE [EXCLUDE ...]]] [--workers WORKERS] A web crawler based on requests-html, mainly targets for url validation test. optional arguments: -h, --help show this help message and exit -V, --version show version --log-level LOG_LEVEL Specify logging level, default is INFO. --seed SEED Specify crawl seed url --headers [HEADERS [HEADERS ...]] Specify headers, e.g. 'User-Agent:iOS/10.3' --cookies [COOKIES [COOKIES ...]] Specify cookies, e.g. 'lang=en country:us' --requests-limit REQUESTS_LIMIT Specify requests limit for crawler, default rps. --interval-limit INTERVAL_LIMIT Specify limit interval, default 1 second. --include [INCLUDE [INCLUDE ...]] Urls include the snippets will be crawled recursively. --exclude [EXCLUDE [EXCLUDE ...]] Urls include the snippets will be skipped. --workers WORKERS Specify concurrent workers number.
Examples
Basic usage.
$ requests_crawler --seed http://debugtalk.com
Crawl with headers and cookies.
$ requests_crawler --seeds http://debugtalk.com --headers User-Agent:iOS/10.3 --cookies lang:en country:us
Crawl with 30 rps limitation.
$ requests_crawler --seeds http://debugtalk.com --requests-limit 30
Crawl with 500 rpm limitation.
$ requests_crawler --seeds http://debugtalk.com --requests-limit 500 --interval-limit 60
Crawl with extra hosts, e.g. httprunner.org
will also be crawled recursively.
$ requests_crawler --seeds http://debugtalk.com --include httprunner.org
Skip excluded url snippets, e.g. urls include httprunner
will be skipped.
$ requests_crawler --seeds http://debugtalk.com --exclude httprunner
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
requests-crawler-0.5.2.tar.gz
(11.4 kB
view hashes)
Built Distribution
Close
Hashes for requests_crawler-0.5.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16366d80e70e69ae8b90270e4db6855cbd56c479eb04c7797bca5c0cdf3566a1 |
|
MD5 | bf4b7dbf150455ac3f4c001749e00250 |
|
BLAKE2-256 | 1c30c34ddc0638ef6e6f98a3aec6ded65748fe0a3414e79758bff2f32265b1a6 |