Skip to main content

Simplified python article discovery & extraction.

Project description

https://badge.fury.io/py/newspaper.png

Inspired by requests for its simplicity and powered by lxml for its speed; newspaper is a Python 2 library for extracting & curating articles from the web.

Newspaper wants to change the way people handle article extraction with a new, more precise layer of abstraction. Newspaper caches whatever it can for speed. Also, everything is in unicode

Please refer to The Documentation for a quickstart tutorial!

A Glance:

>>> import newspaper

>>> cnn_paper = newspaper.build('http://cnn.com')

>>> for article in cnn_paper.articles:
>>>     print article.url
u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
...

>>> for category in cnn_paper.category_urls():
>>>     print category

u'http://lifestyle.cnn.com'
u'http://cnn.com/world'
u'http://tech.cnn.com'
...
>>> article = cnn_paper.articles[0]
>>> article.download()

>>> article.html
u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> article.parse()

>>> article.authors
[u'Leigh Ann Caldwell', 'John Honway']

>>> article.text
u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
>>> article.nlp()

>>> article.keywords
['New Years', 'resolution', ...]

>>> article.summary
u'The study shows that 93% of people ...'

Documentation

Check out The Documentation for full and detailed guides using newspaper.

Features

  • News url identification

  • Text extraction from html

  • Keyword extraction from text

  • Summary extraction from text

  • Author extraction from text

  • Top image extraction from html

  • All image extraction from html

  • Multi-threaded article download framework

  • Google trending terms extraction

Get it now

$ pip install newspaper

IMPORTANT
If you know for sure that you'll use the natural language features,
nlp(), you must download some separate nltk corpora below.
You must download everything in python 2.6 - 2.7!

$ curl https://raw.github.com/codelucas/newspaper/master/download_corpora.py | python2.7

Todo List

  • Add a “follow_robots.txt” option in the config object.

  • Bake in the CSSSelect and BeautifulSoup dependencies

0.0.4 - Fully integrated python-goose library into newspaper. Article objects

now have much more options. All configurations are now based on Configuration() objects which can be passed into Source or Article objects. Default configuration setups make this easy. Added simple multithreading article download framework.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newspaper-0.0.4.tar.gz (6.7 MB view details)

Uploaded Source

Built Distribution

newspaper-0.0.4.macosx-10.8-intel.exe (6.9 MB view details)

Uploaded Source

File details

Details for the file newspaper-0.0.4.tar.gz.

File metadata

  • Download URL: newspaper-0.0.4.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for newspaper-0.0.4.tar.gz
Algorithm Hash digest
SHA256 359934ee0c47015687ac3b71d51c7d1a87e8b95ff96135bdbe5c4d2e2c20c735
MD5 89f2dc44324b9838cf4923446849d447
BLAKE2b-256 4410cc8abed3de450ea2925601e29951eec9658a19f18572429cc29380ec7ac8

See more details on using hashes here.

File details

Details for the file newspaper-0.0.4.macosx-10.8-intel.exe.

File metadata

File hashes

Hashes for newspaper-0.0.4.macosx-10.8-intel.exe
Algorithm Hash digest
SHA256 0e5e1c47863c23c4992d5365b1bce4c57fdd134d12ad260a36e81586dc78979e
MD5 ccf7a795cd9af87a23ea95a002131ec2
BLAKE2b-256 9afe35192071bf02cab3db681d08b83b365bf35a98035b06d640a5f4082b4cf8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page