A base class for building web scrapers for statistical data.
Project description
Statscraper is a base library for building web scrapers for statistical data, with a helper ontology for (primarily Swedish) statistical data. A set of ready-to-use scrapers are included.
For users
You can use Statscraper as a foundation for your next scraper, or try out any of the included scrapers. With Statscraper comes a unified interface for scraping, and some useful helper methods for scraper authors.
Full documentation: ReadTheDocs
For updates and discussion: Facebook
By Journalism++ Stockholm, and Robin Linderborg.
Installing
pip install statscraper
Using a scraper
Scrapers acts like “cursors” that move around a hierarchy of datasets and collections of dataset. Collections and datasets are refered to as “items”.
┏━ Collection ━━━ Collection ━┳━ Dataset ROOT ━╋━ Collection ━┳━ Dataset ┣━ Dataset ┗━ Collection ┣━ Dataset ┗━ Dataset ┗━ Dataset ╰─────────────────────────┬───────────────────────╯ items
Here’s a simple example, with a scraper that returns only a single dataset:
>>> from statscraper.scrapers import Cranes
>>> scraper = Cranes()
>>> scraper.items # List available datasets
[<Dataset: Number of cranes>]
>>> dataset = scraper.items[0]
>>> dataset.dimensions
[<Dimension: date (date)>, <Dimension: month (month)>, <Dimension: year (year)>]
>>> dataset.data[0] # Print first row of data
7
>>> dict(dataset.data[0])
{'date': '1', 'year': '2010', 'value': '7', 'month': 'januari'}
Building a scraper
Scrapers are built by extending a base scraper, or a derative of that. You need to provide a method for listing datasets or collections of datasets, and for fetching data.
Statscraper is built for statistical data, meaning that it’s most useful when the data you are scraping/fetching can be organized with a numerical value in each row:
city |
year |
value |
---|---|---|
Voi |
2009 |
45483 |
Kabarnet |
2006 |
10191 |
Taveta |
2009 |
67505 |
A scraper can override these methods:
_fetch_itemslist(item) to yield collections or datasets at the current cursor position
_fetch_data(dataset) to yield rows from the currently selected dataset
_fetch_dimensions(dataset) to yield dimensions available for the currently selected dataset
_fetch_allowed_values(dimension) to yield allowed values for a dimension
A number of hooks are avaiable for more advanced scrapers. These are called by adding the on decorator on a method:
@BaseScraper.on("up")
def my_method(self):
# Do something when the user moves up one level
Available hooks are:
init: Called when initiating the BaseScraper
up: Called when trying to go up one level
select: Called when trying to move to a Collection or Dataset
top: Called when reaching the top level
For developers
These instructions are for developers working on the BaseScraper. See above for instructions for developing a scraper using the BaseScraper.
Downloading
git clone https://github.com/jplusplus/skrejperpark
python setup.py install
Tests
python setup.py test
Run python setup.py test from the root directory. This will install everything needed for testing, before running tests with nosetests.
Changelog
1.0.0.dev2 - Implement translation - Add Dataset.fetch_next() as generator for results
1.0.0.dev1 - Semantic versioning starts here - Implement datatypes and dialects
0.0.2
Added some demo scrapers
The cursor is now moved when accessing datasets
Renamed methods for moving cursor: move_up(), move_to()
Added tests
Added datatypes subtree
0.0.1
First version
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for statscraper-1.0.0.dev2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e91fe3aeb7394c03a8182c0d3b144af65a9a9fb206bd9dd72feca79cd3634e2 |
|
MD5 | 0be90754ff765e1375404002b20bb02b |
|
BLAKE2b-256 | 95c13d1e93564de1a24ff54a46f7e7b41b635d27991193468bcd50ebc91dc513 |