statscraper

A base class for building web scrapers for statistical data.

These details have not been verified by PyPI

Project links

Project description

Statscraper is a base library for building web scrapers for statistical data, with a helper ontology for (primarily Swedish) statistical data. A set of ready-to-use scrapers are included.

For users

You can use Statscraper as a foundation for your next scraper, or try out any of the included scrapers. With Statscraper comes a unified interface for scraping, and some useful helper methods for scraper authors.

Full documentation: ReadTheDocs

For updates and discussion: Facebook

By Journalism++ Stockholm, and Robin Linderborg.

Installing

pip install statscraper

Using a scraper

Scrapers acts like “cursors” that move around a hierarchy of datasets and collections of datasets. Collections and datasets are refered to as “items”.

      ┏━ Collection ━━━ Collection ━┳━ Dataset
ROOT ━╋━ Collection ━┳━ Dataset     ┣━ Dataset
      ┗━ Collection  ┣━ Dataset     ┗━ Dataset
                     ┗━ Dataset

╰─────────────────────────┬───────────────────────╯
                     items

Here’s a simple example, with a scraper that returns only a single dataset: The number of cranes spotted at Hornborgarsjön each day as scraped from Länsstyrelsen i Västra Götalands län.

>>> from statscraper.scrapers import Cranes

>>> scraper = Cranes()
>>> scraper.items  # List available datasets
[<Dataset: Number of cranes>]

>>> dataset = scraper["Number of cranes"]
>>> dataset.dimensions
[<Dimension: date (Day of the month)>, <Dimension: month>, <Dimension: year>]

>>> row = dataset.data[0]  # first row in this dataset
>>> row
<Result: 7 (value)>
>>> row.dict
{'value': '7', u'date': u'7', u'month': u'march', u'year': u'2015'}

>>> df = dataset.data.pandas  # get this dataset as a Pandas dataframe

Building a scraper

Scrapers are built by extending a base scraper, or a derative of that. You need to provide a method for listing datasets or collections of datasets, and for fetching data.

Statscraper is built for statistical data, meaning that it’s most useful when the data you are scraping/fetching can be organized with a numerical value in each row:

city	year	value
Voi	2009	45483
Kabarnet	2006	10191
Taveta	2009	67505

A scraper can override these methods:

_fetch_itemslist(item) to yield collections or datasets at the current cursor position
_fetch_data(dataset) to yield rows from the currently selected dataset
_fetch_dimensions(dataset) to yield dimensions available for the currently selected dataset
_fetch_allowed_values(dimension) to yield allowed values for a dimension

A number of hooks are avaiable for more advanced scrapers. These are called by adding the on decorator on a method:

@BaseScraper.on("up")
def my_method(self):
  # Do something when the user moves up one level

For developers

These instructions are for developers working on the BaseScraper. See above for instructions for developing a scraper using the BaseScraper.

Downloading

git clone https://github.com/jplusplus/statscraper
python setup.py install

This repo includes statscraper-datatypes as a subtree. To update this, do:

git subtree pull --prefix statscraper/datatypes git@github.com:jplusplus/statscraper-datatypes.git master --squash

Tests

Since 2.0.0 we are using pytest. To run an individual test:

python3 -m pytest tests/test-datatypes.py

Changelog

The changelog has been moved to CHANGELOG.md.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.0.0rc0 pre-release

Feb 28, 2025

This version

2.0.2

Mar 23, 2023

2.0.1

Nov 22, 2022

2.0.1.dev7 pre-release

Nov 16, 2022

2.0.0.dev7 pre-release

Oct 5, 2020

2.0.0.dev6 pre-release

Mar 3, 2020

2.0.0.dev5 pre-release

Feb 28, 2020

2.0.0.dev4 pre-release

Feb 28, 2020

2.0.0.dev3 pre-release

Feb 27, 2020

2.0.0.dev2 pre-release

Feb 23, 2020

1.0.7

Jun 30, 2018

1.0.6

Feb 6, 2018

1.0.5

Sep 22, 2017

1.0.3

Sep 21, 2017

1.0.0

Aug 28, 2017

1.0.0.dev2 pre-release

Aug 9, 2017

1.0.0.dev1 pre-release

Jun 30, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

statscraper-2.0.2.tar.gz (61.8 kB view details)

Uploaded Mar 23, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

statscraper-2.0.2-py3-none-any.whl (54.5 kB view details)

Uploaded Mar 23, 2023 Python 3

File details

Details for the file statscraper-2.0.2.tar.gz.

File metadata

Download URL: statscraper-2.0.2.tar.gz
Upload date: Mar 23, 2023
Size: 61.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for statscraper-2.0.2.tar.gz
Algorithm	Hash digest
SHA256	`310c507eac337bc5266ff6122239077d923e6c14ebfdc69aac893fe2ac614f0e`
MD5	`059308b3195be52c444b94a0bf016890`
BLAKE2b-256	`b21606bd930d0f6436c2d7bbcc56117feb95a064bfa1d3705e98f4ce92f7ce25`

See more details on using hashes here.

File details

Details for the file statscraper-2.0.2-py3-none-any.whl.

File metadata

Download URL: statscraper-2.0.2-py3-none-any.whl
Upload date: Mar 23, 2023
Size: 54.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for statscraper-2.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b80a5674800c31ca5f5b92a41c2e9a10d1c19df6a3fd8345f2a8236b0370a887`
MD5	`033ee289e6639ef3b51021c41f2600ac`
BLAKE2b-256	`29a9701b51797932dc54a722a6dba8f7043db9a72eff7e8978a7c478367aa90a`

See more details on using hashes here.

statscraper 2.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

For users

Installing

Using a scraper

Building a scraper

For developers

Downloading

Tests

Changelog

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes