A web crawler based on requests-html, mainly targets for url validation test.
Project description
requests-crawler
A web crawler based on requests-html, mainly targets for url validation test.
Features
- based on requests-html, full JavaScript support!
- support requests frequency limitation, e.g. rps/rpm
- support crawl with headers and cookies
- include & exclude mechanism
- group visited urls by HTTP status code
- display url's referers and hyper links
Installation/Upgrade
$ pip install -U git+https://github.com/debugtalk/WebCrawler.git
Only Python 3.6 is supported.
To ensure the installation or upgrade is successful, you can execute command requests_crawler -V
to see if you can get the correct version number.
$ requests_crawler -V
0.5.2
Usage
$ requests_crawler -h
usage: requests_crawler [-h] [-V] [--log-level LOG_LEVEL]
[--seed SEED]
[--headers [HEADERS [HEADERS ...]]]
[--cookies [COOKIES [COOKIES ...]]]
[--requests-limit REQUESTS_LIMIT]
[--interval-limit INTERVAL_LIMIT]
[--include [INCLUDE [INCLUDE ...]]]
[--exclude [EXCLUDE [EXCLUDE ...]]]
[--workers WORKERS]
A web crawler based on requests-html, mainly targets for url validation test.
optional arguments:
-h, --help show this help message and exit
-V, --version show version
--log-level LOG_LEVEL
Specify logging level, default is INFO.
--seed SEED Specify crawl seed url
--headers [HEADERS [HEADERS ...]]
Specify headers, e.g. 'User-Agent:iOS/10.3'
--cookies [COOKIES [COOKIES ...]]
Specify cookies, e.g. 'lang=en country:us'
--requests-limit REQUESTS_LIMIT
Specify requests limit for crawler, default rps.
--interval-limit INTERVAL_LIMIT
Specify limit interval, default 1 second.
--include [INCLUDE [INCLUDE ...]]
Urls include the snippets will be crawled recursively.
--exclude [EXCLUDE [EXCLUDE ...]]
Urls include the snippets will be skipped.
--workers WORKERS Specify concurrent workers number.
Examples
Basic usage.
$ requests_crawler --seed http://debugtalk.com
Crawl with headers and cookies.
$ requests_crawler --seeds http://debugtalk.com --headers User-Agent:iOS/10.3 --cookies lang:en country:us
Crawl with 30 rps limitation.
$ requests_crawler --seeds http://debugtalk.com --requests-limit 30
Crawl with 500 rpm limitation.
$ requests_crawler --seeds http://debugtalk.com --requests-limit 500 --interval-limit 60
Crawl with extra hosts, e.g. httprunner.org
will also be crawled recursively.
$ requests_crawler --seeds http://debugtalk.com --include httprunner.org
Skip excluded url snippets, e.g. urls include httprunner
will be skipped.
$ requests_crawler --seeds http://debugtalk.com --exclude httprunner
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
requests-crawler-0.5.2.tar.gz
(11.4 kB
view hashes)
Built Distribution
Close
Hashes for requests_crawler-0.5.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16366d80e70e69ae8b90270e4db6855cbd56c479eb04c7797bca5c0cdf3566a1 |
|
MD5 | bf4b7dbf150455ac3f4c001749e00250 |
|
BLAKE2b-256 | 1c30c34ddc0638ef6e6f98a3aec6ded65748fe0a3414e79758bff2f32265b1a6 |