Simplified python article discovery & extraction.
Newspaper wants to change the way people handle article extraction with a new, more precise layer of abstraction. Newspaper caches whatever it can for speed. Also, everything is in unicode
Please refer to The Documentation for a quickstart tutorial!
>>> import newspaper >>> cnn_paper = newspaper.build('http://cnn.com') >>> for article in cnn_paper.articles: >>> print article.url u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/' u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html' ... >>> for category in cnn_paper.category_urls(): >>> print category u'http://lifestyle.cnn.com' u'http://cnn.com/world' u'http://tech.cnn.com' ...
>>> article = cnn_paper.articles
>>> article.download() >>> article.html u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> article.parse() >>> article.authors [u'Leigh Ann Caldwell', 'John Honway'] >>> article.text u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
>>> article.nlp() >>> article.keywords ['New Years', 'resolution', ...] >>> article.summary u'The study shows that 93% of people ...'
Check out The Documentation for full and detailed guides using newspaper.
- News url identification
- Text extraction from html
- Keyword extraction from text
- Summary extraction from text
- Author extraction from text
- Top image extraction from html
- All image extraction from html
- Multi-threaded article download framework
- Google trending terms extraction
Get it now
$ pip install newspaper IMPORTANT If you know for sure that you'll use the natural language features, nlp(), you must download some separate nltk corpora below. You must download everything in python 2.6 - 2.7! $ curl https://raw.github.com/codelucas/newspaper/master/download_corpora.py | python2.7
- Add a “follow_robots.txt” option in the config object.
- Bake in the CSSSelect and BeautifulSoup dependencies
- 0.0.4 - Fully integrated python-goose library into newspaper. Article objects
- now have much more options. All configurations are now based on Configuration() objects which can be passed into Source or Article objects. Default configuration setups make this easy. Added simple multithreading article download framework.
Release history Release notifications
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size & hash SHA256 hash help||File type||Python version||Upload date|
|newspaper-0.0.4.macosx-10.8-intel.exe (6.9 MB) Copy SHA256 hash SHA256||Windows Installer||any|
|newspaper-0.0.4.tar.gz (6.7 MB) Copy SHA256 hash SHA256||Source||None|