Skip to main content

Python Test Crawler

Project description

Python Testing Crawler :snake: :stethoscope: :spider:

PyPI version PyPI Supported Python Versions GitHub license GitHub Actions (Tests)

A crawler for automated functional testing of a web application

Crawling a server-side-rendered web application is a low cost way to get low quality test coverage of your JavaScript-light web application.

If you have only partial test coverage of your routes, but still want to protect against silly mistakes, then this is for you.

Features:

  • Selectively spider pages and resources, or just request them
  • Submit forms, and control what values to send
  • Ignore links by source using CSS selectors
  • Fail fast or collect many errors
  • Configurable using straightforward rules

Works with the test clients for Flask (inc Flask-WebTest), Django and WebTest.

Why should I use this?

Here's an example: Flaskr, the Flask tutorial application has 166 lines of test code to achieve 100% test coverage.

Using Python Testing Crawler in a similar way to the Usage example below, we can hit 73% with very little effort. Disclaimer: Of course! It's not the same quality or utility of testing! But it is better than no tests, a complement to hand-written unit or functional tests and a useful stopgap.

Installation

$ pip install python-testing-crawler

Usage

Create a crawler using your framework's existing test client, tell it where to start and what rules to obey, then set it off:

from python_testing_crawler import Crawler
from python_testing_crawler import Rule, Request, Ignore, Allow

def test_crawl_all():
    client = ## ... existing testing client
    ## ... any setup ...
    crawler = Crawler(
        client=my_testing_client,
        initial_paths=['/'],
        rules=[
            Rule("a", '/.*', "GET", Request()),
        ]
    )
    crawler.crawl()

This will crawl all anchor links to relative addresses beginning "/". Any exceptions encountered will be collected and presented at the end of the crawl. For more power see the Rules section below.

If you need to authorise the client's session, e.g. login, then you should that before creating the Crawler.

It is also a good idea to create enough data, via fixtures or otherwise, to expose enough endpoints.

How do I setup a test client?

It depends on your framework:

Crawler Options

Param Description
initial_paths list of paths/URLs to start from
rules list of Rules to control the crawler; see below
path_attrs list of attribute names to extract paths/URLs from; defaults to "href" -- include "src" if you want to check e.g. <link>, <script> or even <img>
ignore_css_selectors any elements matching this list of CSS selectors will be ignored when extracting links
ignore_form_fields list of form input names to ignore when determining the identity/uniqueness of a form. Include CSRF token field names here.
max_requests Crawler will raise an exception if this limit is exceeded
capture_exceptions upon encountering an exception, keep going and fail at the end of the crawl instead of during (default True)
output_summary print summary statistics and any captured exceptions and tracebacks at the end of the crawl (default True)
should_process_handlers list of "should process" handlers; see Handlers section
check_response_handlers list of "check response" handlers; see Handlers section

Rules

The crawler has to be told what URLs to follow, what forms to post and what to ignore, using Rules.

Rules are made of four parameters:

Rule(<source element regex>, <target URL/path regex>, <HTTP method>, <action to take>)

These are matched against every HTML element that the crawler encounters, with the last matching rule winning.

Actions must be one of the following objects:

  1. Request(only=False, params=None) -- follow a link or submit a form
    • only=True will retrieve a page/resource but not spider its links.
    • the dict params allows you to specify overrides for a form's default values
  2. Ignore() -- do nothing / skip
  3. Allow(status_codes) -- allow a HTTP status in the supplied list, i.e. do not consider it an error.

Example Rules

Follow all local/relative links

HYPERLINKS_ONLY_RULE_SET = [
    Rule('a', '/.*', 'GET', Request()),
    Rule('area', '/.*', 'GET', Request()),
]

Request but do not spider all links

REQUEST_ONLY_EXTERNAL_RULE_SET = [
    Rule('a', '.*', 'GET', Request(only=True)),
    Rule('area', '.*', 'GET', Request(only=True)),
]

This is useful for finding broken links. You can also check <link> tags from the <head> if you include the following rule plus set the Crawler's path_attrs to ("HREF", "SRC").

Rule('link', '.*', 'GET', Request())

Submit forms with GET or POST

SUBMIT_GET_FORMS_RULE_SET = [
    Rule('form', '.*', 'GET', Request())
]

SUBMIT_POST_FORMS_RULE_SET = [
    Rule('form', '.*', 'POST', Request())
]

Forms are submitted with their default values, unless overridden using Request(params={...}) for a specific form target or excluded using (globally) using the ignore_form_fields parameter to Crawler (necessary for e.g. CSRF token fields).

Allow some routes to fail

PERMISSIVE_RULE_SET = [
    Rule('.*', '.*', 'GET', Allow([*range(400, 600)])),
    Rule('.*', '.*', 'POST', Allow([*range(400, 600)]))
]

If any HTTP error (400-599) is encountered for any request, allow it; do not error.

Crawl Graph

The crawler builds up a graph of your web application. It can be interrogated via crawler.graph when the crawl is finished.

See the graph module for the defintion of Node objects.

Handlers

Two hooks points are provided. These operate on Node objects (see above).

Whether to process a Node

Using should_process_handlers, you can register functions that take a Node and return a bool of whether the Crawler should "process" -- follow a link or submit a form -- or not.

Whether a response is acceptable

Using check_response_handlers, you can register functions that take a Node and response object (specific to your test client) and return a bool of whether the response should constitute an error.

If your function returns True, the Crawler with throw an exception.

Examples

There are currently Flask and Django examples in the tests.

See https://github.com/python-testing-crawler/flaskr for an example of integrating into an existing application, using Flaskr, the Flask tutorial application.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python-testing-crawler-0.2.1.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

python_testing_crawler-0.2.1-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file python-testing-crawler-0.2.1.tar.gz.

File metadata

  • Download URL: python-testing-crawler-0.2.1.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for python-testing-crawler-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0b2b7fa34911c984d55b90a26e0d5700c74fce1672616d0c830fda1af98f6efd
MD5 e4722cf2bf9d6faf66346f35f66780be
BLAKE2b-256 e07d36ce54c4c3ada969632f5043271dbc2079d1d73fe8d15185d2e4710e2179

See more details on using hashes here.

File details

Details for the file python_testing_crawler-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: python_testing_crawler-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 32.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for python_testing_crawler-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8c296b8d7983581eeaa3ff6edf714ddd0c42391a86edcaf0a90c43ca5c15f7cf
MD5 0cff4ec32976331ea7bcc0dfe32f834d
BLAKE2b-256 2b1f03a4417c1694ee00e835b0bf31cb5dcd82db5c2c03b461fa776844f9ee12

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page