Skip to main content

A web spider for collecting specific data across a set of configured sites

Project description

Parker is a Python-based web spider for collecting specific data across a set of configured sites.

Non-Python requirements:

  • Redis - for task queuing and visit tracking
  • libxml - for HTML parsing of pages

Installation

Install using pip:

$ pip install parker

Configuration

To configure Parker, you will need to install the configuration files in a suitable location for the user running Parker. To do this, use the parker-config script. For example:

$ parker-config ~/.parker

This will install the configuration in your homedir and will output the related environment variable for you to set in your .bashrc.

Changes

0.6.0

  • Add tracking of visited URIs as well as page hashes to the crawl worker. Use that to reduce the number of URIs added to the crawl queue.

0.5.1

  • Fix an issue with the order of key-value reference resolution that prevented the effective use of unique_field if using a field that was a kv_ref.
  • Add some Parker specific configuration so we can specify where to download, in case the PROJECT env variable doesn’t exist.

0.5.0

  • Update ConsumeModel to post process the data. This enables us to populate specific data from a reference to a key-value field.
  • Reorder changes so newest first, and rename to “Changes” in the long description.

0.4.2

  • Bug fix to fix RST headers which may be the problem.
  • Remove the decode/encode which is not the issue.

0.4.1

  • Bug fix to see if RST in ASCII fixes issues on PyPI.

0.4.0

  • Added handling for a PARKER_CONFIG environment variable, allowing users to specify where configuration files are loaded from.
  • Added the parker-config script to install default configuration files to a passed location. Also prints out an example PARKER_CONFIG environment variable to add to your profile files.
  • Updated documentation to use proper reStructuredText files.
  • Add a CHANGES file to track updates.

Project details


Release history Release notifications

History Node

0.9.6

History Node

0.9.5

History Node

0.9.4

History Node

0.9.3

History Node

0.9.2

History Node

0.9.1

History Node

0.9.0

History Node

0.8.0

History Node

0.7.3

History Node

0.7.2

History Node

0.7.1

History Node

0.7.0

This version
History Node

0.6.0

History Node

0.5.1

History Node

0.5.0

History Node

0.4.2

History Node

0.4.1

History Node

0.4.0

History Node

0.3.1

History Node

0.3.0

History Node

0.2.3

History Node

0.2.2

History Node

0.2.1

History Node

0.2.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
Parker-0.6.0-py2.py3-none-any.whl (18.0 kB) Copy SHA256 hash SHA256 Wheel 2.7 Jul 22, 2014
Parker-0.6.0.tar.gz (138.9 kB) Copy SHA256 hash SHA256 Source None Jul 22, 2014

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page