A library for scraping the German news archive of Tagesschau.de
Project description
A library for scraping the German news archive of Tagesschau.de
Install
Tagesschauscraper is available on PyPI:
$ pip install tagesschauscraper
Usage
Here’s an example of how to use the library to scrape teaser info from the Tagesschau archive:
import os
from datetime import date
from tagesschauscraper import constants, helper, tagesschau
# Scraping teaser published on <date_> and in specific news category
DATA_DIR = "data"
date_ = date(2022,3,1)
category = "wirtschaft"
# Initialize scraper, create url and run
tagesschauScraper = tagesschau.TagesschauScraper()
url = tagesschau.create_url_for_news_archive(date_, category=category)
teaser = tagesschauScraper.scrape_teaser(url)
# Save output in a hierarchical directory tree
if not os.path.isdir(DATA_DIR):
os.mkdir(DATA_DIR)
dateDirectoryTreeCreator = helper.DateDirectoryTreeCreator(
date_, root_dir=DATA_DIR
)
file_path = dateDirectoryTreeCreator.create_file_path_from_date()
dateDirectoryTreeCreator.make_dir_tree_from_file_path(file_path)
file_name_and_path = os.path.join(
file_path,
helper.create_file_name_from_date(
date_, suffix="_" + category, extension=".json"
),
)
logging.info(f"Save scraped teaser to file {file_name_and_path}")
helper.save_to_json(teaser, file_name_and_path)
The result saved in “data/2022/03/2022-03-01_wirtschaft.json”. Json document looks the following (only a snippet):
{ "teaser": [ { "date": "2022-03-01 22:23:00", "topline": "Deutliche Verluste", "headline": "Der Krieg lastet auf der Wall Street", "shorttext": "Die intensiven K\u00e4mpfe in der Ukraine und die Auswirkungen der Sanktionen verschreckten die US-Investoren.", "link": "https://www.tagesschau.de/wirtschaft/finanzen/marktberichte/marktbericht-dax-dow-jones-213.html", "tags": "B\u00f6rse,DAX,Dow Jones,Marktbericht", "id": "d49cfb71130e46638dcfe2afe8d775ac9670a9a8" }, { "date": "2022-03-01 18:54:00", "topline": "Pipeline-Projekt", "headline": "Nordstream-Betreiber offenbar insolvent", "shorttext": "Die Nord Stream 2 AG, die Schweizer Eigent\u00fcmergesellschaft der neuen Ostsee-Pipeline nach Russland, ist offenbar insolvent.", "link": "https://www.tagesschau.de/wirtschaft/unternehmen/nord-stream-insolvenz-gazrom-gas-pipeline-russland-ukraine-103.html", "tags": "Insolvenz,Nord Stream 2,Pipeline,Russland,Schweiz", "id": "595aa643ed39edd3695b8401a99ce808afa539fb" }, { "date": "2022-03-01 18:52:00", "topline": "Fehlende Teile wegen Ukraine-Kriegs", "headline": "Autobauern drohen Produktionsausf\u00e4lle", "shorttext": "Der anhaltende Krieg in der Ukraine bremst auch die deutsche Autoindustrie.", "link": "https://www.tagesschau.de/wirtschaft/autobauern-drohen-produktionsausfaelle-101.html", "tags": "Autowerke,BMW,Mercedes,Produktionsausf\u00e4lle,Ukraine,Ukraine-Krieg,VW", "id": "914174596c3590784c903908f569c099475981f7" }, ...
Contributing
If you’d like to contribute to TagesschauScraper, please fork the repository and make changes as you’d like. Pull requests are welcome.
License
TagesschauScraper is licensed under the GPL-3.0 license.
Disclaimer
Please note that this is a scraping tool, and using it to scrape website data without the website owner’s consent may be against their terms of service. Use at your own risk.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tagesschauscraper-0.1.2.tar.gz
.
File metadata
- Download URL: tagesschauscraper-0.1.2.tar.gz
- Upload date:
- Size: 21.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84fd15e485b31c5eee9b10baad949ca229b2dcaf43fbafc569bfd5b33ffb4eb8 |
|
MD5 | 71ff9f1e565253fea85e21b83ded91a1 |
|
BLAKE2b-256 | 5bf670e23288d029220c8a4f53c021973734aca2b7201814c3136d56476625ff |
File details
Details for the file tagesschauscraper-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: tagesschauscraper-0.1.2-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc22839be1f1a3904ff3495e2eecbf2cd25e28573f39519d5c0bdd8a1f9ae612 |
|
MD5 | 93e8f459bc42fb634089a1610618134c |
|
BLAKE2b-256 | 685de1291265682cc2e8634947afa695af3a05ebd69dcca1d8cccb3fc4e21ceb |