A tiny layer on top of Newspaper3k with support for Unix-like executions and parallelism (using asyncio) to download bulks of articles faster.
Project description
Newspaper3kli
Newspaper3kli stands for the "kommand-line" interface over Newspaper3k.
A tiny layer on top of Newspaper3k with support for Unix-like executions and parallelism (using asyncio) to download bulks of articles faster.
Requirements
In addition to the requirements, make sure you have nltk's
punkt package installed (via nlkt.download() in
interactive Python) for Newspaper3k's article.nlp() to work
properly.
Installation
# assuming your OS has pip3 as default
pip install newspaper3kli==0.1.0
Usage
Overview of available parameters
usage: newspaper3kli [-h] [--url URL] [-r] [-o OUTPUT] [-u] [-m MAX_RETRIES]
[-b BACKOFF]
[urls [urls ...]]
positional arguments:
urls URL to download content from (single download)
optional arguments:
-h, --help show this help message and exit
--url URL Enter the URLs to download content from.
-r, --redirects Flag to enable follow redirects in web pages.
-o OUTPUT, --output OUTPUT
Output path to store the results
-u, --unverified Select to allow unverified SSL certificates.
-m MAX_RETRIES, --max_retries MAX_RETRIES
Set the max number of retries (default 0 to fail on
first retry).
-b BACKOFF, --backoff BACKOFF
Set the backoff factor (default 0).
Executing
Passing URLs from the terminal
newspaper3kli https://hello.world/article/2020 \
https://hello.world/article/2019
Reading from a txt file
TXT is the simplest file format for reading with Newspaper3kli.
Assuming the txt file has the following content (line delimited URLs):
https://hello.world/article/2020
https://hello.world/article/2019
cat /path/to/this/file.txt | newspaper3kli
Reading from a CSV file
CSV parsing will depend in a tool like awk or cut to split the columns.
Content sample
url,tags,date
https://hello.world/article/2020,some|thing,2020-01-01T00:00:00
https://hello.world/article/2019,some|thing,2019-01-01T00:00:00
Processing
# note that $1 corresponds to the URLs column number, change to yours
cat /path/to/this/file.csv | awk -F, 'NR==50{ print $1 }' | newspaper3kli
For any other character-delimited content, simple change from -F, (comma) to the desired format, e.g.: -F\t for TSV
Output path
When no path is specified through --output parameter, the default path is
output inside Newspaper3kli's directory. Files are created according to
Article's name, and are stored in pairs:
- JSON for metadata;
- HTML for content;
Credits
Thanks to dsynkov for the work at newspaper-bulk. The source of inspiration and some code for this project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file newspaper3kli-0.1.1.tar.gz.
File metadata
- Download URL: newspaper3kli-0.1.1.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c5b73f8d6b5ea424d7fbafc798996fc1c6791094a1a63af48a60621b7a921c1
|
|
| MD5 |
5e4c122ccb647f8cbec05a880032982f
|
|
| BLAKE2b-256 |
7f25a1ac781a1643a625bdbee4a9959ec579d0c5c280dc0e013cb4efffdce490
|
File details
Details for the file newspaper3kli-0.1.1-py3-none-any.whl.
File metadata
- Download URL: newspaper3kli-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d212ec4304df4644a34ef738754c9b7fc251aa1eacde32728276500fae70cea
|
|
| MD5 |
2dbb062d6230b4ef0c690362243a5b9c
|
|
| BLAKE2b-256 |
13f3956827e5accf7983d36f7c0c6fc6a8810419b728193fe341ad48c5be713e
|