A robust parser which can extract title, content, images from html pages
Project description
import urllib2 from jparser import PageModel html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030') pm = PageModel(html) result = pm.extract() print "==title==" print result['title'] print "==content==" for x in result['content']: if x['type'] == 'text': print x['data'] if x['type'] == 'image': print "[IMAGE]", x['data']['src']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
jparser-0.0.4.tar.gz
(2.9 kB
view hashes)