A web crawler based on requests-html, mainly targets for url validation test.
Project description
requests-crawler
A web crawler based on requests-html, mainly targets for url validation test.
Features
- based on requests-html, full JavaScript support!
- support requests frequency limitation, e.g. rps/rpm
- support crawl with headers and cookies
- include & exclude mechanism
- group visited urls by HTTP status code
- display url's referers and hyper links
Installation/Upgrade
$ pip install requests-crawler
Only Python 3.6 is supported.
To ensure the installation or upgrade is successful, you can execute command requests_crawler -V
to see if you can get the correct version number.
$ requests_crawler -V
0.5.3
Usage
$ requests_crawler -h
usage: requests_crawler [-h] [-V] [--log-level LOG_LEVEL]
[--seed SEED]
[--headers [HEADERS [HEADERS ...]]]
[--cookies [COOKIES [COOKIES ...]]]
[--requests-limit REQUESTS_LIMIT]
[--interval-limit INTERVAL_LIMIT]
[--include [INCLUDE [INCLUDE ...]]]
[--exclude [EXCLUDE [EXCLUDE ...]]]
[--workers WORKERS]
A web crawler based on requests-html, mainly targets for url validation test.
optional arguments:
-h, --help show this help message and exit
-V, --version show version
--log-level LOG_LEVEL
Specify logging level, default is INFO.
--seed SEED Specify crawl seed url
--headers [HEADERS [HEADERS ...]]
Specify headers, e.g. 'User-Agent:iOS/10.3'
--cookies [COOKIES [COOKIES ...]]
Specify cookies, e.g. 'lang=en country:us'
--requests-limit REQUESTS_LIMIT
Specify requests limit for crawler, default rps.
--interval-limit INTERVAL_LIMIT
Specify limit interval, default 1 second.
--include [INCLUDE [INCLUDE ...]]
Urls include the snippets will be crawled recursively.
--exclude [EXCLUDE [EXCLUDE ...]]
Urls include the snippets will be skipped.
--workers WORKERS Specify concurrent workers number.
Examples
Basic usage.
$ requests_crawler --seed http://debugtalk.com
Crawl with headers and cookies.
$ requests_crawler --seed http://debugtalk.com --headers User-Agent:iOS/10.3 --cookies lang:en country:us
Crawl with 30 rps limitation.
$ requests_crawler --seed http://debugtalk.com --requests-limit 30
Crawl with 500 rpm limitation.
$ requests_crawler --seed http://debugtalk.com --requests-limit 500 --interval-limit 60
Crawl with extra hosts, e.g. httprunner.org
will also be crawled recursively.
$ requests_crawler --seed http://debugtalk.com --include httprunner.org
Skip excluded url snippets, e.g. urls include httprunner
will be skipped.
$ requests_crawler --seed http://debugtalk.com --exclude httprunner
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file requests-crawler-0.5.4.tar.gz
.
File metadata
- Download URL: requests-crawler-0.5.4.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa4e14bc0c2d203765747266b8836e487aa777d86dc0bd34972c7f8cdae098e3 |
|
MD5 | f88b3224bc2d36c9d782c5cbe0896e23 |
|
BLAKE2b-256 | 612ace5aa0db4a6d81e27d8676be7734399cf986fda0401f12b220a9fba63785 |
File details
Details for the file requests_crawler-0.5.4-py2.py3-none-any.whl
.
File metadata
- Download URL: requests_crawler-0.5.4-py2.py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2b557e8fdfa5e66ccbea8ce7baf7424cc6d4b3582375963b1fee1b23e47a297 |
|
MD5 | e9fad5898a1c350dd7730443f3feeb1e |
|
BLAKE2b-256 | 7db7c73ab226be33a1e1788f8b5165b700f7557266354f6db7619494c9267e6a |