Skip to main content

A web spider for collecting specific data across a set of configured sites

Project description

Parker is a Python-based web spider for collecting specific data across a set of configured sites.

Non-Python requirements:

  • Redis - for task queuing and visit tracking

  • libxml - for HTML parsing of pages

Installation

Install using pip:

$ pip install parker

Configuration

To configure Parker, you will need to install the configuration files in a suitable location for the user running Parker. To do this, use the parker-config script. For example:

$ parker-config ~/.parker

This will install the configuration in your homedir and will output the related environment variable for you to set in your .bashrc.

Changes

0.9.3

  • Fixed an issue where images pushed to S3 lost their file extension.

0.9.2

  • Fixed an issue where pushing images to S3 where using the wrong filename.

0.9.1

  • Fixed an issue where the S3 bucket configuration wasn’t being passed through.

0.9.0

  • Completely refactored storage and added the ability to store consumed data to Amazon’s S3 service. This requires your AWS Access Key ID to be set up as an environment variable.

  • Added an ‘unclassified’ path prefix for sites configured without a classification.

  • Added a ‘crawl_uri_filters’ site configuration list which should contain regular expressions to match any URIs you would like filtered out of the crawl.

  • Added a ‘seconds_until_expire’ site configuration value which sets the expiry of the sets stored in Redis. Would recommend you ensure this is set to an interval similar to how long it takes Parker to crawl your site so that when it finishes, the Sets will expire shortly after to save memory.

0.8.0

  • Updated key-value functionality to allow a sub-selector to return a list of values.

  • Added the ability to add an expiry time in seconds to the RedisSet objects, with a default of 5 days.

  • Added an extra set to track URIs that are already on the crawl queue. This should hopefully cut down on duplication but may eat memory if there are multiple possiple URIs for the same page.

0.7.3

  • Patch to fix an issue where the consumer was overlooking media URIs that start with / and are therefore relative to the base_uri configuration.

  • Added boto to the requirements for future use.

0.7.2

  • Patch to fix an issue where the crawler was overlooking URIs that start with / and are therefore relative to the base_uri configuration.

0.7.1

  • Patch to fix an issue where, if class is not present in the site config, the path includes “None”.

0.7.0

  • Rework the client to allow for improved proxy failover should we need it. Improve testing a little to back this up.

  • Add tagging to the configuration. These are simply passed through to the resulting JSON objects output by the model so that you can tag them with whatever you want.

  • Add classification to the configuration. Again this is passed through, but is also used in the output file path from the consumer worker.

0.6.0

  • Add tracking of visited URIs as well as page hashes to the crawl worker. Use that to reduce the number of URIs added to the crawl queue.

0.5.1

  • Fix an issue with the order of key-value reference resolution that prevented the effective use of unique_field if using a field that was a kv_ref.

  • Add some Parker specific configuration so we can specify where to download, in case the PROJECT env variable doesn’t exist.

0.5.0

  • Update ConsumeModel to post process the data. This enables us to populate specific data from a reference to a key-value field.

  • Reorder changes so newest first, and rename to “Changes” in the long description.

0.4.2

  • Bug fix to fix RST headers which may be the problem.

  • Remove the decode/encode which is not the issue.

0.4.1

  • Bug fix to see if RST in ASCII fixes issues on PyPI.

0.4.0

  • Added handling for a PARKER_CONFIG environment variable, allowing users to specify where configuration files are loaded from.

  • Added the parker-config script to install default configuration files to a passed location. Also prints out an example PARKER_CONFIG environment variable to add to your profile files.

  • Updated documentation to use proper reStructuredText files.

  • Add a CHANGES file to track updates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Parker-0.9.3.tar.gz (143.8 kB view details)

Uploaded Source

Built Distribution

Parker-0.9.3-py2.py3-none-any.whl (22.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file Parker-0.9.3.tar.gz.

File metadata

  • Download URL: Parker-0.9.3.tar.gz
  • Upload date:
  • Size: 143.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for Parker-0.9.3.tar.gz
Algorithm Hash digest
SHA256 5ed5780f30c591c375f7abc52770f0b9ed48d95feb0afd5c0248f2dc339057b5
MD5 3aa001a2028dd51561b5a4d8cd92f585
BLAKE2b-256 9d1c4763afb87458f009ded13e1b50301e3cbc3b4340036709c0d8df1868f8fa

See more details on using hashes here.

File details

Details for the file Parker-0.9.3-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for Parker-0.9.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8fc5f4cae32b4d36db7ac8bf53e70dd048685d817f6462d5cea071e1d46005d4
MD5 34836880b3b84b218e835e66df432e21
BLAKE2b-256 479f4d02856732b8aa29ee8d7a96ef2cf6d0add7623c42ef43fc3b10e3308d3b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page