A tiny layer on top of Newspaper3k with support for Unix-like executions and parallelism (using asyncio) to download bulks of articles faster.
Project description
Newspaper3kli
Newspaper3kli stands for the "kommand-line" interface over Newspaper3k.
A tiny layer on top of Newspaper3k with support for Unix-like executions and parallelism (using asyncio) to download bulks of articles faster.
Requirements
In addition to the requirements, make sure you have nltk
's
punkt
package installed (via nlkt.download()
in
interactive Python) for Newspaper3k's article.nlp()
to work
properly.
Installation
# assuming your OS has pip3 as default
pip install newspaper3kli==0.1.0
Usage
Overview of available parameters
usage: newspaper3kli [-h] [--url URL] [-r] [-o OUTPUT] [-u] [-m MAX_RETRIES]
[-b BACKOFF]
[urls [urls ...]]
positional arguments:
urls URL to download content from (single download)
optional arguments:
-h, --help show this help message and exit
--url URL Enter the URLs to download content from.
-r, --redirects Flag to enable follow redirects in web pages.
-o OUTPUT, --output OUTPUT
Output path to store the results
-u, --unverified Select to allow unverified SSL certificates.
-m MAX_RETRIES, --max_retries MAX_RETRIES
Set the max number of retries (default 0 to fail on
first retry).
-b BACKOFF, --backoff BACKOFF
Set the backoff factor (default 0).
Executing
Passing URLs from the terminal
newspaper3kli https://hello.world/article/2020 \
https://hello.world/article/2019
Reading from a txt file
TXT is the simplest file format for reading with Newspaper3kli.
Assuming the txt file has the following content (line delimited URLs):
https://hello.world/article/2020
https://hello.world/article/2019
cat /path/to/this/file.txt | newspaper3kli
Reading from a CSV file
CSV parsing will depend in a tool like awk
or cut
to split the columns.
Content sample
url,tags,date
https://hello.world/article/2020,some|thing,2020-01-01T00:00:00
https://hello.world/article/2019,some|thing,2019-01-01T00:00:00
Processing
# note that $1 corresponds to the URLs column number, change to yours
cat /path/to/this/file.csv | awk -F, 'NR==50{ print $1 }' | newspaper3kli
For any other character-delimited content, simple change from -F, (comma) to the desired format, e.g.: -F\t for TSV
Output path
When no path is specified through --output
parameter, the default path is
output
inside Newspaper3kli's directory. Files are created according to
Article's name, and are stored in pairs:
- JSON for metadata;
- HTML for content;
Credits
Thanks to dsynkov for the work at newspaper-bulk. The source of inspiration and some code for this project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file newspaper3kli-0.1.1.tar.gz
.
File metadata
- Download URL: newspaper3kli-0.1.1.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c5b73f8d6b5ea424d7fbafc798996fc1c6791094a1a63af48a60621b7a921c1 |
|
MD5 | 5e4c122ccb647f8cbec05a880032982f |
|
BLAKE2b-256 | 7f25a1ac781a1643a625bdbee4a9959ec579d0c5c280dc0e013cb4efffdce490 |
File details
Details for the file newspaper3kli-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: newspaper3kli-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d212ec4304df4644a34ef738754c9b7fc251aa1eacde32728276500fae70cea |
|
MD5 | 2dbb062d6230b4ef0c690362243a5b9c |
|
BLAKE2b-256 | 13f3956827e5accf7983d36f7c0c6fc6a8810419b728193fe341ad48c5be713e |