Skip to main content

A web crawler based on requests-html, mainly targets for url validation test.

Project description

requests-crawler

A web crawler based on requests-html, mainly targets for url validation test.

Features

  • based on requests-html, full JavaScript support!
  • support requests frequency limitation, e.g. rps/rpm
  • support crawl with headers and cookies
  • include & exclude mechanism
  • group visited urls by HTTP status code
  • display url's referers and hyper links

Installation/Upgrade

$ pip install -U git+https://github.com/debugtalk/WebCrawler.git

Only Python 3.6 is supported.

To ensure the installation or upgrade is successful, you can execute command requests_crawler -V to see if you can get the correct version number.

$ requests_crawler -V
0.5.2

Usage

$ requests_crawler -h
usage: requests_crawler [-h] [-V] [--log-level LOG_LEVEL]
                        [--seed SEED]
                        [--headers [HEADERS [HEADERS ...]]]
                        [--cookies [COOKIES [COOKIES ...]]]
                        [--requests-limit REQUESTS_LIMIT]
                        [--interval-limit INTERVAL_LIMIT]
                        [--include [INCLUDE [INCLUDE ...]]]
                        [--exclude [EXCLUDE [EXCLUDE ...]]]
                        [--workers WORKERS]

A web crawler based on requests-html, mainly targets for url validation test.

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show version
  --log-level LOG_LEVEL
                        Specify logging level, default is INFO.
  --seed SEED           Specify crawl seed url
  --headers [HEADERS [HEADERS ...]]
                        Specify headers, e.g. 'User-Agent:iOS/10.3'
  --cookies [COOKIES [COOKIES ...]]
                        Specify cookies, e.g. 'lang=en country:us'
  --requests-limit REQUESTS_LIMIT
                        Specify requests limit for crawler, default rps.
  --interval-limit INTERVAL_LIMIT
                        Specify limit interval, default 1 second.
  --include [INCLUDE [INCLUDE ...]]
                        Urls include the snippets will be crawled recursively.
  --exclude [EXCLUDE [EXCLUDE ...]]
                        Urls include the snippets will be skipped.
  --workers WORKERS     Specify concurrent workers number.

Examples

Basic usage.

$ requests_crawler --seed http://debugtalk.com

Crawl with headers and cookies.

$ requests_crawler --seeds http://debugtalk.com --headers User-Agent:iOS/10.3 --cookies lang:en country:us

Crawl with 30 rps limitation.

$ requests_crawler --seeds http://debugtalk.com --requests-limit 30

Crawl with 500 rpm limitation.

$ requests_crawler --seeds http://debugtalk.com --requests-limit 500 --interval-limit 60

Crawl with extra hosts, e.g. httprunner.org will also be crawled recursively.

$ requests_crawler --seeds http://debugtalk.com --include httprunner.org

Skip excluded url snippets, e.g. urls include httprunner will be skipped.

$ requests_crawler --seeds http://debugtalk.com --exclude httprunner

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests-crawler-0.5.2.tar.gz (11.4 kB view hashes)

Uploaded Source

Built Distribution

requests_crawler-0.5.2-py2.py3-none-any.whl (20.6 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page