A readability parser which can extract title, content, images from html pages
Project description
import urllib2 from jparser import PageModel html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030') pm = PageModel(html) result = pm.extract() print "==title==" print result['title'] print "==content==" for x in result['content']: if x['type'] == 'text': print x['data'] if x['type'] == 'image': print "[IMAGE]", x['data']['src']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
jparser-0.0.20.tar.gz
(3.5 kB
view details)
File details
Details for the file jparser-0.0.20.tar.gz
.
File metadata
- Download URL: jparser-0.0.20.tar.gz
- Upload date:
- Size: 3.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6b3c6ff5cc20c615f4b097c4f1b765495a315790110a5032f694b72ac6b392b |
|
MD5 | 4d6a655aaf14a49e0b61c8a2692b8ea3 |
|
BLAKE2b-256 | 78fea080447f4058c0961d8db205f278d8fc4f623bdb581a5c56e750012af3a4 |