Skip to main content

Script that scrapes news articles in the 2021 Semeval Task 8 format from the Internet Archive

Project description

semeval 2022 task 8 downloader

https://img.shields.io/pypi/v/semeval-8-2022-ia-downloader.svg

Script that scrapes news articles in the 2022 Semeval Task 8 format from the Internet Archive.

Details about the data and the task in the project homepage.

A pair of articles with id 0123456789_9876543210 will be stored in output_dir/89/0123456789.{html|json} and output_dir/10/9876543210.{html|json} respectively.

The HTML file contains the web page of the article as obtained from the internet archive. The json file contains additional information extracted from the page using the package newspaper3k.

The code is available on github, together with sample input data (sample_data.csv)

Usage

python3 -m venv venv
source venv/bin/activate
pip install semeval_8_2022_ia_downloader
python -m semeval_8_2022_ia_downloader.cli --links_file=input.csv --dump_dir=output_dir

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.1.0 (2021-07-30)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semeval_8_2022_ia_downloader-0.1.7.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semeval_8_2022_ia_downloader-0.1.7-py2.py3-none-any.whl (14.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file semeval_8_2022_ia_downloader-0.1.7.tar.gz.

File metadata

  • Download URL: semeval_8_2022_ia_downloader-0.1.7.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.7.1 requests/2.26.0 setuptools/57.4.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.6

File hashes

Hashes for semeval_8_2022_ia_downloader-0.1.7.tar.gz
Algorithm Hash digest
SHA256 934b78fbb3990e1e0c129f061a2720915482f0b716efb397fc6e3c1c89aa0939
MD5 5925cc365e49179d25ca3a7f4e51a94c
BLAKE2b-256 479e3f4be18d5304f7e8b23535203977ba193d298e1dbde51f30e89c986120f0

See more details on using hashes here.

File details

Details for the file semeval_8_2022_ia_downloader-0.1.7-py2.py3-none-any.whl.

File metadata

  • Download URL: semeval_8_2022_ia_downloader-0.1.7-py2.py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.7.1 requests/2.26.0 setuptools/57.4.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.6

File hashes

Hashes for semeval_8_2022_ia_downloader-0.1.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e8e24f12f2a78f376be4ffe8f7047810894914a34c6249e3c4b644e4a9a68f97
MD5 6f57b1618c6808261023400379c859ee
BLAKE2b-256 39affca66de973c4a00459eed609cc7ea71c11a5c0c3431b47bc7097fa6a07df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page