A web spider for collecting specific data across a set of configured sites
Parker is a Python-based web spider for collecting specific data across a set of configured sites.
- Redis - for task queuing and visit tracking
- libxml - for HTML parsing of pages
Install using pip:
$ pip install parker
To configure Parker, you will need to install the configuration files in a suitable location for the user running Parker. To do this, use the parker-config script. For example:
$ parker-config ~/.parker
This will install the configuration in your homedir and will output the related environment variable for you to set in your .bashrc.
- Patch to fix an issue where the crawler was overlooking URIs that start with / and are therefore relative to the base_uri configuration.
- Patch to fix an issue where, if class is not present in the site config, the path includes “None”.
- Rework the client to allow for improved proxy failover should we need it. Improve testing a little to back this up.
- Add tagging to the configuration. These are simply passed through to the resulting JSON objects output by the model so that you can tag them with whatever you want.
- Add classification to the configuration. Again this is passed through, but is also used in the output file path from the consumer worker.
- Add tracking of visited URIs as well as page hashes to the crawl worker. Use that to reduce the number of URIs added to the crawl queue.
- Fix an issue with the order of key-value reference resolution that prevented the effective use of unique_field if using a field that was a kv_ref.
- Add some Parker specific configuration so we can specify where to download, in case the PROJECT env variable doesn’t exist.
- Update ConsumeModel to post process the data. This enables us to populate specific data from a reference to a key-value field.
- Reorder changes so newest first, and rename to “Changes” in the long description.
- Bug fix to fix RST headers which may be the problem.
- Remove the decode/encode which is not the issue.
- Bug fix to see if RST in ASCII fixes issues on PyPI.
- Added handling for a PARKER_CONFIG environment variable, allowing users to specify where configuration files are loaded from.
- Added the parker-config script to install default configuration files to a passed location. Also prints out an example PARKER_CONFIG environment variable to add to your profile files.
- Updated documentation to use proper reStructuredText files.
- Add a CHANGES file to track updates.
Release history Release notifications
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size & hash SHA256 hash help||File type||Python version||Upload date|
|Parker-0.7.2-py2.py3-none-any.whl (18.8 kB) Copy SHA256 hash SHA256||Wheel||2.7||Sep 1, 2014|
|Parker-0.7.2.tar.gz (139.7 kB) Copy SHA256 hash SHA256||Source||None||Sep 1, 2014|