Skip to main content

A web crawler based on requests-html, mainly targets for url validation test.

Project description

requests-crawler

A web crawler based on requests-html, mainly targets for url validation test.

Features

  • based on requests-html, full JavaScript support!
  • support requests frequency limitation, e.g. rps/rpm
  • support crawl with headers and cookies
  • include & exclude mechanism
  • group visited urls by HTTP status code
  • display url's referers and hyper links

Installation/Upgrade

$ pip install requests-crawler

Only Python 3.6 is supported.

To ensure the installation or upgrade is successful, you can execute command requests_crawler -V to see if you can get the correct version number.

$ requests_crawler -V
0.5.3

Usage

$ requests_crawler -h
usage: requests_crawler [-h] [-V] [--log-level LOG_LEVEL]
                        [--seed SEED]
                        [--headers [HEADERS [HEADERS ...]]]
                        [--cookies [COOKIES [COOKIES ...]]]
                        [--requests-limit REQUESTS_LIMIT]
                        [--interval-limit INTERVAL_LIMIT]
                        [--include [INCLUDE [INCLUDE ...]]]
                        [--exclude [EXCLUDE [EXCLUDE ...]]]
                        [--workers WORKERS]

A web crawler based on requests-html, mainly targets for url validation test.

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show version
  --log-level LOG_LEVEL
                        Specify logging level, default is INFO.
  --seed SEED           Specify crawl seed url
  --headers [HEADERS [HEADERS ...]]
                        Specify headers, e.g. 'User-Agent:iOS/10.3'
  --cookies [COOKIES [COOKIES ...]]
                        Specify cookies, e.g. 'lang=en country:us'
  --requests-limit REQUESTS_LIMIT
                        Specify requests limit for crawler, default rps.
  --interval-limit INTERVAL_LIMIT
                        Specify limit interval, default 1 second.
  --include [INCLUDE [INCLUDE ...]]
                        Urls include the snippets will be crawled recursively.
  --exclude [EXCLUDE [EXCLUDE ...]]
                        Urls include the snippets will be skipped.
  --workers WORKERS     Specify concurrent workers number.

Examples

Basic usage.

$ requests_crawler --seed http://debugtalk.com

Crawl with headers and cookies.

$ requests_crawler --seed http://debugtalk.com --headers User-Agent:iOS/10.3 --cookies lang:en country:us

Crawl with 30 rps limitation.

$ requests_crawler --seed http://debugtalk.com --requests-limit 30

Crawl with 500 rpm limitation.

$ requests_crawler --seed http://debugtalk.com --requests-limit 500 --interval-limit 60

Crawl with extra hosts, e.g. httprunner.org will also be crawled recursively.

$ requests_crawler --seed http://debugtalk.com --include httprunner.org

Skip excluded url snippets, e.g. urls include httprunner will be skipped.

$ requests_crawler --seed http://debugtalk.com --exclude httprunner

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests-crawler-0.5.4.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

requests_crawler-0.5.4-py2.py3-none-any.whl (20.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file requests-crawler-0.5.4.tar.gz.

File metadata

File hashes

Hashes for requests-crawler-0.5.4.tar.gz
Algorithm Hash digest
SHA256 fa4e14bc0c2d203765747266b8836e487aa777d86dc0bd34972c7f8cdae098e3
MD5 f88b3224bc2d36c9d782c5cbe0896e23
BLAKE2b-256 612ace5aa0db4a6d81e27d8676be7734399cf986fda0401f12b220a9fba63785

See more details on using hashes here.

File details

Details for the file requests_crawler-0.5.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for requests_crawler-0.5.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e2b557e8fdfa5e66ccbea8ce7baf7424cc6d4b3582375963b1fee1b23e47a297
MD5 e9fad5898a1c350dd7730443f3feeb1e
BLAKE2b-256 7db7c73ab226be33a1e1788f8b5165b700f7557266354f6db7619494c9267e6a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page