Skip to main content

A library for scraping the German news archive of Tagesschau.de

Project description

A library for scraping the German news archive of Tagesschau.de

Install

Tagesschauscraper is available on PyPI:

$ pip install tagesschauscraper

Usage

Here’s an example of how to use the library to scrape teaser info from the Tagesschau archive:

import os
from datetime import date
from tagesschauscraper import constants, helper, tagesschau

# Scraping teaser published on <date_> and in specific news category
DATA_DIR = "data"
date_ = date(2022,3,1)
category = "wirtschaft"

# Initialize scraper, create url and run
tagesschauScraper = tagesschau.TagesschauScraper()
url = tagesschau.create_url_for_news_archive(date_, category=category)
teaser = tagesschauScraper.scrape_teaser(url)

# Save output in a hierarchical directory tree
if not os.path.isdir(DATA_DIR):
    os.mkdir(DATA_DIR)
dateDirectoryTreeCreator = helper.DateDirectoryTreeCreator(
    date_, root_dir=DATA_DIR
)
file_path = dateDirectoryTreeCreator.create_file_path_from_date()
dateDirectoryTreeCreator.make_dir_tree_from_file_path(file_path)
file_name_and_path = os.path.join(
    file_path,
    helper.create_file_name_from_date(
        date_, suffix="_" + category, extension=".json"
    ),
)
logging.info(f"Save scraped teaser to file {file_name_and_path}")
helper.save_to_json(teaser, file_name_and_path)

The result saved in “data/2022/03/2022-03-01_wirtschaft.json”. Json document looks the following (only a snippet):

{
    "teaser": [
        {
            "date": "2022-03-01 22:23:00",
            "topline": "Deutliche Verluste",
            "headline": "Der Krieg lastet auf der Wall Street",
            "shorttext": "Die intensiven K\u00e4mpfe in der Ukraine und die Auswirkungen der Sanktionen verschreckten die US-Investoren.",
            "link": "https://www.tagesschau.de/wirtschaft/finanzen/marktberichte/marktbericht-dax-dow-jones-213.html",
            "tags": "B\u00f6rse,DAX,Dow Jones,Marktbericht",
            "id": "d49cfb71130e46638dcfe2afe8d775ac9670a9a8"
        },
        {
            "date": "2022-03-01 18:54:00",
            "topline": "Pipeline-Projekt",
            "headline": "Nordstream-Betreiber offenbar insolvent",
            "shorttext": "Die Nord Stream 2 AG, die Schweizer Eigent\u00fcmergesellschaft der neuen Ostsee-Pipeline nach Russland, ist offenbar insolvent.",
            "link": "https://www.tagesschau.de/wirtschaft/unternehmen/nord-stream-insolvenz-gazrom-gas-pipeline-russland-ukraine-103.html",
            "tags": "Insolvenz,Nord Stream 2,Pipeline,Russland,Schweiz",
            "id": "595aa643ed39edd3695b8401a99ce808afa539fb"
        },
        {
            "date": "2022-03-01 18:52:00",
            "topline": "Fehlende Teile wegen Ukraine-Kriegs",
            "headline": "Autobauern drohen Produktionsausf\u00e4lle",
            "shorttext": "Der anhaltende Krieg in der Ukraine bremst auch die deutsche Autoindustrie.",
            "link": "https://www.tagesschau.de/wirtschaft/autobauern-drohen-produktionsausfaelle-101.html",
            "tags": "Autowerke,BMW,Mercedes,Produktionsausf\u00e4lle,Ukraine,Ukraine-Krieg,VW",
            "id": "914174596c3590784c903908f569c099475981f7"
        },
        ...

Contributing

If you’d like to contribute to TagesschauScraper, please fork the repository and make changes as you’d like. Pull requests are welcome.

License

TagesschauScraper is licensed under the GPL-3.0 license.

Disclaimer

Please note that this is a scraping tool, and using it to scrape website data without the website owner’s consent may be against their terms of service. Use at your own risk.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tagesschauscraper-0.1.2.tar.gz (21.8 kB view details)

Uploaded Source

Built Distribution

tagesschauscraper-0.1.2-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file tagesschauscraper-0.1.2.tar.gz.

File metadata

  • Download URL: tagesschauscraper-0.1.2.tar.gz
  • Upload date:
  • Size: 21.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for tagesschauscraper-0.1.2.tar.gz
Algorithm Hash digest
SHA256 84fd15e485b31c5eee9b10baad949ca229b2dcaf43fbafc569bfd5b33ffb4eb8
MD5 71ff9f1e565253fea85e21b83ded91a1
BLAKE2b-256 5bf670e23288d029220c8a4f53c021973734aca2b7201814c3136d56476625ff

See more details on using hashes here.

File details

Details for the file tagesschauscraper-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for tagesschauscraper-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cc22839be1f1a3904ff3495e2eecbf2cd25e28573f39519d5c0bdd8a1f9ae612
MD5 93e8f459bc42fb634089a1610618134c
BLAKE2b-256 685de1291265682cc2e8634947afa695af3a05ebd69dcca1d8cccb3fc4e21ceb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page