Skip to main content

Simplified python article discovery & extraction.

Project description

Newspaper4k: Article scraping & curation, a continuation of the beloved newspaper3k by codelucas

image

Build status

Coverage status

At the moment the Newspaper4k Project is a fork of the well known newspaper3k by codelucas which was not updated since Sept 2020. The initial goal of this fork is to keep the project alive and to add new features and fix bugs.

I have duplicated all issues on the original project and will try to fix them. If you have any issues or feature requests please open an issue here.

Python compatibility

- Recommended: Python 3.8+
- Python 3.6+ minimum
- Fixes for Python < 3.8 are low priority and might not be merged

Quick start

pip install newspaper4k
from newspaper import Article

article = Article('https://edition.cnn.com/2023/10/29/sport/nfl-week-8-how-to-watch-spt-intl/index.html')
article.download()
article.parse()

print(article.authors)
# ['Hannah Brewitt', 'Minute Read', 'Published', 'Am Edt', 'Sun October']

print(article.publish_date)
# 2023-10-29 09:00:15.717000+00:00

print(article.text)
# New England Patriots head coach Bill Belichick, right, embraces Buffalo Bills head coach Sean McDermott ...

print(article.top_image)
#https://media.cnn.com/api/v1/images/stellar/prod/231015223702-06-nfl-season-gallery-1015.jpg?c=16x9&q=w_800,c_fill

print(article.movies)
# []

article.nlp()
print(article.keywords)
# ['broncos', 'game', 'et', 'wide', 'chiefs', 'mahomes', 'patrick', 'denver', 'nfl', 'stadium', 'week', 'quarterback', 'win', 'history', 'images']

print(article.summary)
# Kevin Sabitus/Getty Images Denver Broncos running back Javonte Williams evades Green Bay Packers safety Darnell Savage, bottom.
# Kathryn Riley/Getty Images Kansas City Chiefs quarterback Patrick Mahomes calls a play during the Chiefs' 19-8 Thursday Night Football win over the Denver Broncos on October 12.
# Paul Sancya/AP New York Jets running back Breece Hall carries the ball during a game against the Denver Broncos.
# The Broncos have not beaten the Chiefs since 2015, and have never beaten Chiefs quarterback Patrick Mahomes.
# Australia: NFL+, ESPN, 7Plus Brazil: NFL+, ESPN Canada: NFL+, CTV, TSN, RDS Germany: NFL+, ProSieben MAXX, DAZN Mexico: NFL+, TUDN, ESPN, Fox Sports, Sky Sports UK: NFL+, Sky Sports, ITV, Channel 5 US: NFL+, CBS, NBC, FOX, ESPN, Amazon Prime

Using the builder API

import newspaper

cnn_paper = newspaper.build('http://cnn.com')
print(cnn_paper.category_urls())
# ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com', 'https://cnnespanol.cnn.com', 'http://edition.cnn.com', 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']

article_urls = [article.url for article in cnn_paper.articles]
print(article_urls[:3])
# ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson', 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations', 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']

article = cnn_paper.articles[0]
article.download()
article.parse()

print(article.title)
# المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى

``` pycon
from newspaper import fulltext

html = requests.get(...).text
text = fulltext(html)

Newspaper can extract and detect languages seamlessly. If no language is specified, Newspaper will attempt to auto detect a language.

from newspaper import Article

article = Article('https://www.bbc.com/zhongwen/simp/chinese-news-67084358', language='zh')
a.download()
a.parse()

print(a.title)
# 晶片大战:台湾厂商助攻华为突破美国封锁?

Docs

Check out The Docs for full and detailed guides using newspaper.

Contributing

Adding languages

Interested in adding a new language for us? Refer to: Docs - Adding new languages

Submitting a PR

Interested in submitting a PR? Refer to: Docs - Submitting a PR

Submitting an issue

Before submitting an issue, please check if it has already been reported. Additionally, please check that:

Also, in any case, please provide the following information:

  • The URL of the article you are trying to parse
  • The code you are using to parse the article
  • The error you are getting (if any)
  • The parsing result you are getting (if any)

Features

  • Multi-threaded article download framework
  • News url identification
  • Text extraction from html
  • Top image extraction from html
  • All image extraction from html
  • Keyword extraction from text
  • Summary extraction from text
  • Author extraction from text
  • Google trending terms extraction
  • Works in 10+ languages (English, Chinese, German, Arabic, ...)

Requirements and dependencies

If you are on Debian / Ubuntu, install using the following:

  • Install pip3 command needed to install newspaper3k package:

    $ sudo apt-get install python3-pip
    
  • Python development version, needed for Python.h:

    $ sudo apt-get install python-dev
    
  • lxml requirements:

    $ sudo apt-get install libxml2-dev libxslt-dev
    
  • For PIL to recognize .jpg images:

    $ sudo apt-get install libjpeg-dev zlib1g-dev libpng12-dev
    

NOTE: If you find problem installing libpng12-dev, try installing libpng-dev.

  • Download NLP related corpora:

    $ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
    
  • Install the distribution via pip:

    $ pip3 install newspaper3k
    

If you are on OSX, install using the following, you may use both homebrew or macports:

$ brew install libxml2 libxslt

$ brew install libtiff libjpeg webp little-cms2

$ pip3 install newspaper3k

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

Otherwise, install with the following:

NOTE: You will still most likely need to install the following libraries via your package manager

  • PIL: libjpeg-dev zlib1g-dev libpng12-dev
  • lxml: libxml2-dev libxslt-dev
  • Python Development version: python-dev
<!-- -->
$ pip3 install newspaper3k

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

LICENSE

Authored and maintained by Andrei Paraschiv.

Newspaper was originally developed by Lucas Ou-Yang (codelucas), the original repository can be found here. Newspaper is licensed under the MIT license.

Credits

Thanks to Lucas Ou-Yang for creating the original Newspaper3k project and to all contributors of the original project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newspaper4k-0.9.0.tar.gz (206.1 kB view details)

Uploaded Source

Built Distribution

newspaper4k-0.9.0-py3-none-any.whl (214.6 kB view details)

Uploaded Python 3

File details

Details for the file newspaper4k-0.9.0.tar.gz.

File metadata

  • Download URL: newspaper4k-0.9.0.tar.gz
  • Upload date:
  • Size: 206.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.9.18 Linux/6.3.13-060313-generic

File hashes

Hashes for newspaper4k-0.9.0.tar.gz
Algorithm Hash digest
SHA256 f56329a0f9d2ccc173bbe78223dff87949fa6ca1bd993362d0b312228d1f96a4
MD5 73ca1628dade51b86920b44a6c95a1a3
BLAKE2b-256 3240eb8a0e7416ff4a0990948754e1ee4f79097d4ba7b80871d71b9fdb12291e

See more details on using hashes here.

Provenance

File details

Details for the file newspaper4k-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: newspaper4k-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 214.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.9.18 Linux/6.3.13-060313-generic

File hashes

Hashes for newspaper4k-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 faf4e64ddc73f87696ae5d4f791dada91381bc254928eda3deff7761c0b63a89
MD5 2b775dc457ba980c401a856569ca022c
BLAKE2b-256 323c284b59f66113aa57528a918341781229b1fabb0ef53c94876ec53ad53304

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page