Skip to main content

Script that scrapes news articles in the 2021 Semeval Task 8 format from the Internet Archive

Project description

semeval 2022 task 8 downloader

https://img.shields.io/pypi/v/semeval-8-2022-ia-downloader.svg

Script that scrapes news articles in the 2022 Semeval Task 8 format from the Internet Archive.

Details about the data and the task in the project homepage.

A pair of articles with id 0123456789_9876543210 will be stored in output_dir/89/0123456789.{html|json} and output_dir/10/9876543210.{html|json} respectively.

The HTML file contains the web page of the article as obtained from the internet archive. The json file contains additional information extracted from the page using the package newspaper3k.

The code is available on github, together with sample input data (sample_data.csv)

Usage

python3 -m venv venv
source venv/bin/activate
pip install semeval_8_2022_ia_downloader
python -m semeval_8_2022_ia_downloader.cli --links_file=input.csv --dump_dir=output_dir

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.1.0 (2021-07-30)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semeval_8_2022_ia_downloader-0.1.7.tar.gz (17.4 kB view hashes)

Uploaded Source

Built Distribution

semeval_8_2022_ia_downloader-0.1.7-py2.py3-none-any.whl (14.2 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page