Scrapy project for feeds into INSPIRE-HEP (http://inspirehep.net).
Project description
HEPcrawl is a harvesting library based on Scrapy (http://scrapy.org) for INSPIRE-HEP (http://inspirehep.net) that focuses on automatic and semi-automatic retrieval of new content from all the sources the site aggregates. In particular content from major and minor publishers in the field of High-Energy Physics.
The project is currently in early stage of development.
Installation for developers
We start by creating a virtual environment for our Python packages:
mkvirtualenv hepcrawl
cdvirtualenv
mkdir src && cd src
Now we grab the code and install it in development mode:
git clone https://github.com/inspirehep/hepcrawl.git
cd hepcrawl
pip install -e .
Development mode ensures that any changes you do to your sources are automatically taken into account = no need to install again after changing something.
Finally run the tests to make sure all is setup correctly:
python setup.py test
Run example crawler
Thanks to the command line tools provided by Scrapy, we can easily test the spiders as we are developing them. Here is an example using the simple sample spider:
cdvirtualenv src/hepcrawl
scrapy crawl arXiv -a source_file=file://`pwd`/tests/responses/arxiv/sample_arxiv_record.xml
Thanks for contributing!
Changes
Version 0.2.0 (2016-06-02)
11 new spiders, including arXiv, APS, Base OAI source, Elsevier and many more.
Updated HEPRecord data items to conform with updates to INSPIRE data model.
Reorganization of loaders to have one place for input and output processing of metadata.
New pipelines for pushing content crawled to INSPIRE servers.
Better error handling and reporting, including support for Sentry.
Version 0.1.0 (2015-10-26)
Initial commit
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.