Skip to main content

Web scraping API for Finnish websites

Project description

finscraper

This library provides a Python API for downloading data from popular Finnish websites in a structured format.

TODO: Cover picture

Websites covered include:

  • News articles from IltaSanomat (is.fi)

Installation

pip install -r requirements.txt

TODO: Make a pip package?

Documentation

TODO: Sphinx documentation or similar

Quickstart

The library provides an easy-to-use API for fetching data from various Finnish websites. For example, fetching news articles from IltaSanomat and saving them into items.json can be done as follows:

from finscraper.spiders import ISArticle

spider = ISArticle(category='nhl',
                   FEEDS={'items.json': {'format': 'json'}})
spider.run()

Please see example Jupyter notebooks for more detailed reference:

Contributing

Web scrapers break when the formatting of websites change - often. Unfortunately, I can't make a promise to keep this repository up-to-date all by myself, and therefore I invite you to help when a spider stops working:

  1. Update the xpaths of a broken spider within the finscraper/scrapy_spiders -module to correspond to the latest version of the website to be scraped

  2. Create a pull request


Jesse Myrberg (jesse.myrberg@gmail.com)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

finscraper-0.0.1b0.tar.gz (8.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page