Script that scrapes news articles in the 2021 Semeval Task 8 format from the Internet Archive
Project description
semeval 2022 task 8 downloader
Script that scrapes news articles in the 2022 Semeval Task 8 format from the Internet Archive.
Details about the data and the task in the project homepage.
A pair of articles with id 0123456789_9876543210 will be stored in output_dir/89/0123456789.{html|json} and output_dir/10/9876543210.{html|json} respectively.
The HTML file contains the web page of the article as obtained from the internet archive. The json file contains additional information extracted from the page using the package newspaper3k.
The code is available on github, together with sample input data (sample_data.csv)
Usage
python3 -m venv venv
source venv/bin/activate
pip install semeval_8_2022_ia_downloader
python -m semeval_8_2022_ia_downloader.cli --links_file=input.csv --dump_dir=output_dir
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
History
0.1.0 (2021-07-30)
First release on PyPI.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for semeval_8_2022_ia_downloader-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | d41f15da7acae1fb8901b1cae29d90134678078f991028d8ed3341cdced8cad5 |
|
MD5 | a518840ab7afa540955d0081a8536322 |
|
BLAKE2b-256 | 645ac1e64c91f5bc8111655151f55f1418778374fd1eb678c7f473fceaa1f961 |
Hashes for semeval_8_2022_ia_downloader-0.1.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb41694b81fc9d42db94a24e7d008abe47da11260acc3eafed039aae85fc4ed4 |
|
MD5 | ab077ebd64d93ea9895ec274d10b75f9 |
|
BLAKE2b-256 | b229c6ed72ac1a5bffd61c15687d185e442d374d85a243f4540e91d96b7cecc0 |