A base class for building web scrapers for statistical data.
Project description
Statscraper is a base library for building web scrapers for statistical data, with a helper ontology for (primarily Swedish) statistical data. A set of ready-to-use scrapers are included.
For users
You can use Statscraper as a foundation for your next scraper, or try out any of the included scrapers. With Statscraper comes a unified interface for scraping, and some useful helper methods for scraper authors.
Full documentation: ReadTheDocs
For updates and discussion: Facebook
By Journalism++ Stockholm, and Robin Linderborg.
Installing
pip install statscraper
Using a scraper
Scrapers acts like “cursors” that move around a hierarchy of datasets and collections of datasets. Collections and datasets are refered to as “items”.
┏━ Collection ━━━ Collection ━┳━ Dataset ROOT ━╋━ Collection ━┳━ Dataset ┣━ Dataset ┗━ Collection ┣━ Dataset ┗━ Dataset ┗━ Dataset ╰─────────────────────────┬───────────────────────╯ items
Here’s a simple example, with a scraper that returns only a single dataset: The number of cranes spotted at Hornborgarsjön each day as scraped from Länsstyrelsen i Västra Götalands län.
>>> from statscraper.scrapers import Cranes
>>> scraper = Cranes()
>>> scraper.items # List available datasets
[<Dataset: Number of cranes>]
>>> dataset = scraper["Number of cranes"]
>>> dataset.dimensions
[<Dimension: date (Day of the month)>, <Dimension: month>, <Dimension: year>]
>>> row = dataset.data[0] # first row in this dataset
>>> row
<Result: 7 (value)>
>>> row.dict
{'value': '7', u'date': u'7', u'month': u'march', u'year': u'2015'}
>>> df = dataset.data.pandas # get this dataset as a Pandas dataframe
Building a scraper
Scrapers are built by extending a base scraper, or a derative of that. You need to provide a method for listing datasets or collections of datasets, and for fetching data.
Statscraper is built for statistical data, meaning that it’s most useful when the data you are scraping/fetching can be organized with a numerical value in each row:
city |
year |
value |
---|---|---|
Voi |
2009 |
45483 |
Kabarnet |
2006 |
10191 |
Taveta |
2009 |
67505 |
A scraper can override these methods:
_fetch_itemslist(item) to yield collections or datasets at the current cursor position
_fetch_data(dataset) to yield rows from the currently selected dataset
_fetch_dimensions(dataset) to yield dimensions available for the currently selected dataset
_fetch_allowed_values(dimension) to yield allowed values for a dimension
A number of hooks are avaiable for more advanced scrapers. These are called by adding the on decorator on a method:
@BaseScraper.on("up")
def my_method(self):
# Do something when the user moves up one level
For developers
These instructions are for developers working on the BaseScraper. See above for instructions for developing a scraper using the BaseScraper.
Downloading
git clone https://github.com/jplusplus/statscraper
python setup.py install
This repo includes statscraper-datatypes as a subtree. To update this, do:
git subtree pull --prefix statscraper/datatypes git@github.com:jplusplus/statscraper-datatypes.git master --squash
Tests
Since 2.0.0 we are using pytest. To run an individual test:
python3 -m pytest tests/test-datatypes.py
Changelog
The changelog has been moved to CHANGELOG.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file statscraper-2.0.2.tar.gz
.
File metadata
- Download URL: statscraper-2.0.2.tar.gz
- Upload date:
- Size: 61.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 310c507eac337bc5266ff6122239077d923e6c14ebfdc69aac893fe2ac614f0e |
|
MD5 | 059308b3195be52c444b94a0bf016890 |
|
BLAKE2b-256 | b21606bd930d0f6436c2d7bbcc56117feb95a064bfa1d3705e98f4ce92f7ce25 |
File details
Details for the file statscraper-2.0.2-py3-none-any.whl
.
File metadata
- Download URL: statscraper-2.0.2-py3-none-any.whl
- Upload date:
- Size: 54.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b80a5674800c31ca5f5b92a41c2e9a10d1c19df6a3fd8345f2a8236b0370a887 |
|
MD5 | 033ee289e6639ef3b51021c41f2600ac |
|
BLAKE2b-256 | 29a9701b51797932dc54a722a6dba8f7043db9a72eff7e8978a7c478367aa90a |