Web scraping API for Finnish websites
Project description
finscraper
This library provides a Python API for downloading data from popular Finnish websites in a structured format.
Websites covered include:
- News articles from IltaSanomat (is.fi)
Installation
pip install -r requirements.txt
TODO: Make a pip package?
Documentation
TODO: Sphinx documentation or similar
Quickstart
The library provides an easy-to-use API for fetching data from various Finnish websites. For example, fetching news articles from IltaSanomat and saving them into items.json can be done as follows:
from finscraper.spiders import ISArticle
spider = ISArticle(category='nhl',
FEEDS={'items.json': {'format': 'json'}})
spider.run()
Please see example Jupyter notebooks for more detailed reference:
Contributing
Web scrapers break when the formatting of websites change - often. Unfortunately, I can't make a promise to keep this repository up-to-date all by myself, and therefore I invite you to help when a spider stops working:
-
Update the xpaths of a broken spider within the finscraper/scrapy_spiders -module to correspond to the latest version of the website to be scraped
-
Create a pull request
Jesse Myrberg (jesse.myrberg@gmail.com)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.