Skip to main content

A tiny layer on top of Newspaper3k with support for Unix-like executions and parallelism (using asyncio) to download bulks of articles faster.

Project description

Newspaper3kli

Newspaper3kli stands for the "kommand-line" interface over Newspaper3k.

A tiny layer on top of Newspaper3k with support for Unix-like executions and parallelism (using asyncio) to download bulks of articles faster.

Requirements

In addition to the requirements, make sure you have nltk's punkt package installed (via nlkt.download() in interactive Python) for Newspaper3k's article.nlp() to work properly.

Installation

# assuming your OS has pip3 as default
pip install newspaper3kli==0.1.0

Usage

Overview of available parameters

usage: newspaper3kli [-h] [--url URL] [-r] [-o OUTPUT] [-u] [-m MAX_RETRIES]
                     [-b BACKOFF]
                     [urls [urls ...]]

positional arguments:
  urls                  URL to download content from (single download)

optional arguments:
  -h, --help            show this help message and exit
  --url URL             Enter the URLs to download content from.
  -r, --redirects       Flag to enable follow redirects in web pages.
  -o OUTPUT, --output OUTPUT
                        Output path to store the results
  -u, --unverified      Select to allow unverified SSL certificates.
  -m MAX_RETRIES, --max_retries MAX_RETRIES
                        Set the max number of retries (default 0 to fail on
                        first retry).
  -b BACKOFF, --backoff BACKOFF
                        Set the backoff factor (default 0).

Executing

Passing URLs from the terminal

newspaper3kli https://hello.world/article/2020 \
    https://hello.world/article/2019

Reading from a txt file

TXT is the simplest file format for reading with Newspaper3kli.

Assuming the txt file has the following content (line delimited URLs):

https://hello.world/article/2020
https://hello.world/article/2019
cat /path/to/this/file.txt | newspaper3kli

Reading from a CSV file

CSV parsing will depend in a tool like awk or cut to split the columns.

Content sample

url,tags,date
https://hello.world/article/2020,some|thing,2020-01-01T00:00:00
https://hello.world/article/2019,some|thing,2019-01-01T00:00:00

Processing

# note that $1 corresponds to the URLs column number, change to yours
cat /path/to/this/file.csv | awk -F, 'NR==50{ print $1 }' | newspaper3kli

For any other character-delimited content, simple change from -F, (comma) to the desired format, e.g.: -F\t for TSV

Output path

When no path is specified through --output parameter, the default path is output inside Newspaper3kli's directory. Files are created according to Article's name, and are stored in pairs:

  • JSON for metadata;
  • HTML for content;

Credits

Thanks to dsynkov for the work at newspaper-bulk. The source of inspiration and some code for this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newspaper3kli-0.1.1.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

newspaper3kli-0.1.1-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file newspaper3kli-0.1.1.tar.gz.

File metadata

  • Download URL: newspaper3kli-0.1.1.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for newspaper3kli-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1c5b73f8d6b5ea424d7fbafc798996fc1c6791094a1a63af48a60621b7a921c1
MD5 5e4c122ccb647f8cbec05a880032982f
BLAKE2b-256 7f25a1ac781a1643a625bdbee4a9959ec579d0c5c280dc0e013cb4efffdce490

See more details on using hashes here.

File details

Details for the file newspaper3kli-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: newspaper3kli-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for newspaper3kli-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6d212ec4304df4644a34ef738754c9b7fc251aa1eacde32728276500fae70cea
MD5 2dbb062d6230b4ef0c690362243a5b9c
BLAKE2b-256 13f3956827e5accf7983d36f7c0c6fc6a8810419b728193fe341ad48c5be713e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page