Skip to main content

A base class for building web scrapers for statistical data.

Project description

Statscraper is a base library for building web scrapers for statistical data, with a helper ontology for (primarily Swedish) statistical data. A set of ready-to-use scrapers are included.

For users

You can use Statscraper as a foundation for your next scraper, or try out any of the included scrapers. With Statscraper comes a unified interface for scraping, and some useful helper methods for scraper authors.

Full documentation: ReadTheDocs

For updates and discussion: Facebook

By Journalism++ Stockholm, and Robin Linderborg.

Installing

pip install statscraper

Using a scraper

Scrapers acts like “cursors” that move around a hierarchy of datasets and collections of datasets. Collections and datasets are refered to as “items”.

      ┏━ Collection ━━━ Collection ━┳━ Dataset
ROOT ━╋━ Collection ━┳━ Dataset     ┣━ Dataset
      ┗━ Collection  ┣━ Dataset     ┗━ Dataset
                     ┗━ Dataset

╰─────────────────────────┬───────────────────────╯
                     items

Here’s a simple example, with a scraper that returns only a single dataset: The number of cranes spotted at Hornborgarsjön each day as scraped from Länsstyrelsen i Västra Götalands län.

>>> from statscraper.scrapers import Cranes

>>> scraper = Cranes()
>>> scraper.items  # List available datasets
[<Dataset: Number of cranes>]

>>> dataset = scraper["Number of cranes"]
>>> dataset.dimensions
[<Dimension: date (Day of the month)>, <Dimension: month>, <Dimension: year>]

>>> row = dataset.data[0]  # first row in this dataset
>>> row
<Result: 7 (value)>
>>> row.dict
{'value': '7', u'date': u'7', u'month': u'march', u'year': u'2015'}

>>> df = dataset.data.pandas  # get this dataset as a Pandas dataframe

Building a scraper

Scrapers are built by extending a base scraper, or a derative of that. You need to provide a method for listing datasets or collections of datasets, and for fetching data.

Statscraper is built for statistical data, meaning that it’s most useful when the data you are scraping/fetching can be organized with a numerical value in each row:

city

year

value

Voi

2009

45483

Kabarnet

2006

10191

Taveta

2009

67505

A scraper can override these methods:

  • _fetch_itemslist(item) to yield collections or datasets at the current cursor position

  • _fetch_data(dataset) to yield rows from the currently selected dataset

  • _fetch_dimensions(dataset) to yield dimensions available for the currently selected dataset

  • _fetch_allowed_values(dimension) to yield allowed values for a dimension

A number of hooks are avaiable for more advanced scrapers. These are called by adding the on decorator on a method:

@BaseScraper.on("up")
def my_method(self):
  # Do something when the user moves up one level

For developers

These instructions are for developers working on the BaseScraper. See above for instructions for developing a scraper using the BaseScraper.

Downloading

git clone https://github.com/jplusplus/statscraper
python setup.py install

This repo includes statscraper-datatypes as a subtree. To update this, do:

git subtree pull --prefix statscraper/datatypes git@github.com:jplusplus/statscraper-datatypes.git master --squash

Tests

Since 2.0.0 we are using pytest. To run an individual test:

python3 -m pytest tests/test-datatypes.py

Changelog

The changelog has been moved to CHANGELOG.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

statscraper-2.0.2.tar.gz (61.8 kB view details)

Uploaded Source

Built Distribution

statscraper-2.0.2-py3-none-any.whl (54.5 kB view details)

Uploaded Python 3

File details

Details for the file statscraper-2.0.2.tar.gz.

File metadata

  • Download URL: statscraper-2.0.2.tar.gz
  • Upload date:
  • Size: 61.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for statscraper-2.0.2.tar.gz
Algorithm Hash digest
SHA256 310c507eac337bc5266ff6122239077d923e6c14ebfdc69aac893fe2ac614f0e
MD5 059308b3195be52c444b94a0bf016890
BLAKE2b-256 b21606bd930d0f6436c2d7bbcc56117feb95a064bfa1d3705e98f4ce92f7ce25

See more details on using hashes here.

File details

Details for the file statscraper-2.0.2-py3-none-any.whl.

File metadata

  • Download URL: statscraper-2.0.2-py3-none-any.whl
  • Upload date:
  • Size: 54.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for statscraper-2.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b80a5674800c31ca5f5b92a41c2e9a10d1c19df6a3fd8345f2a8236b0370a887
MD5 033ee289e6639ef3b51021c41f2600ac
BLAKE2b-256 29a9701b51797932dc54a722a6dba8f7043db9a72eff7e8978a7c478367aa90a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page