This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

Spydey

A simple web spider with several recursion strategies. Home page is at http://github.com/slinkp/spydey.

It doesn’t do much except follow links and report status. I mostly use it for quick and dirty smoke testing and link checking.

The only unusual feature is the --traversal=pattern option, which does recursive traversal in an unusual order: It tries to recognize patterns in URLs, and will follow URLs of novel patterns before those with patterns it has seen before. When there are no novel patterns to follow, it follows random links to URLs of known patterns. If you use this for smoke-testing a typical modern web app that maps URL patterns to views/controllers, this will very quickly hit all your views/controllers at least once… usually. But it’s not very interesting when pointed at a website that has arbitrarily deep trees (static files, VCS repositories, and the like).

Also, it’s designed so that adding a new recursion strategy is trivial. Spydey was originally written for the purpose of experimenting with different recursive crawling strategies. Read the source.

Oh, and if you install Fabulous, console output is in color.

For lazy, zero-configuration smoke testing, I typically run it like:

spydey -r --stop-on-error --max-requests=200 --traversal=pattern --profile --log-referrer URL

There are a number of other command-line options, many stolen from wget. Use --help to see what they are.

Usage

Usage: spydey [options] URL

Options:
  -h, --help            show this help message and exit
  -r, --recursive       Recur into subdirectories
  -p, --page-requisites
                        Get all images, etc. needed to display HTML page.
  --no-parent           Don't ascend to the parent directory.
  -R REJECT, --reject=REJECT
                        Regex for filenames to reject. May be given multiple
                        times.
  -A ACCEPT, --accept=ACCEPT
                        Regex for filenames to accept. May be given multiple
                        times.
  -t TRAVERSAL, --traversal=TRAVERSAL, --traverse=TRAVERSAL
                        Recursive traversal strategy. Choices are: breadth-
                        first, depth-first, hybrid, pattern, random
  -H, --span-hosts      Go to foreign hosts when recursive.
  -w WAIT, --wait=WAIT  Wait SECONDS between retrievals.
  --random-wait=RANDOM_WAIT
                        Wait from 0...2*WAIT secs between retrievals.
  --loglevel=LOGLEVEL   Log level.
  --log-referrer, --log-referer
                        Log referrer URL for each request.
  --transient-log       Use Fabulous transient logging config.
  --max-redirect=MAX_REDIRECT
                        Maximum number of redirections to follow for a
                        resource.
  --max-requests=MAX_REQUESTS
                        Maximum number of requests to make before exiting. (-1
                        used with --traversal=pattern means exit when out of
                        new patterns)
  --stop-on-error       Stop after the first HTTP error (response code 400 or
                        greater).
  -T TIMEOUT, --timeout=TIMEOUT
                        Set the network timeout in seconds. 0 means no
                        timeout.
  -P, --profile         Print the time to download each resource, and a
                        summary of the 20 slowest at the end.
  --stats               Print a summary of traversal patterns, if
                        --traversal=pattern
  -v, --version         Print version information and exit.

Changelog

0.5

  • Remove useless pattern stats unless –stats is given
  • Fix to prevent spanning hosts when following redirects, unless -H is on.

0.4

  • Add --stop-on-error option
  • Add --max-requests=-1 to mean stop after all patterns are seen (when used with –traversal=pattern)
  • Add usage text automatically to pkg info

0.3

  • Better redirect handling: obeys -A, -R, –max-redirect, and –max-requests options
  • Minor bugfixes and refactoring
Release History

Release History

0.5

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.4r1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.4

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.3

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
spydey-0.5.tar.gz (10.5 kB) Copy SHA256 Checksum SHA256 Source Feb 9, 2012

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting